2021-11-01
When I was in university, I frequently take the time to do completely unrelated to studying prior to an exam. This is because I wanted to encourage myself to not procrastinate and study earlier, as well as be more relaxed for the final exam. Usually, I would write code unrelated to my studies. Tomorrow morning, I will be defending my Master's thesis, which is somewhat like an exam, so I'm extending this tradition for one more time.
That said, this was not entirely voluntary. When I woke up this morning, I was alerted that some of the services I run in my home server is down. The following sequence of event occurred:
1. Initial investigation suggests problem with the communication between the physical server and the VM. Since I have to work, I responded by attempting to reboot the server. This turned out to be a bad idea, the server then appeared to have lost its network connectivity entirely.
2. Further investigation showed that the server was indeed connected to the network, just not its usual IP address the DHCP server reserves for it, as the MAC address it was using is the one on the bridge interface (br0) [1], instead of the one on eth0. This was very strange, but I worked around it by changing the br0 configuration to inet static instead of inet dhcp.
3. This still was not enough to resolve the issue, as the services I am hosting (A) refused to boot due to what appears to be a known bug in the minor version I deployed when it comes to communicating with another service (B). I then proceeded to upgrade the version of service A.
4. However, service A continued to refuse to boot, as it complained about a missing dependency, that I have installed and did not uninstall.
5. Glancing at the logs, I noted that service A is trying to use Python 3.9, when the system Python version should have been 3.7 (Debian buster). This set off alarm bells, as Debian releases don't tend to upgrade versions.
6. Inspection of /var/log/apt/history.log showed that large parts of the system has been upgraded to sid, due to the action of installing rclone from sid [2].
[1] This is necessary because the VM is bridged to the network directly without additional routing.
[2] /var/log/apt/history.log excerpt
This now makes perfect sense, as Python 3.9 doesn't have the dependencies I need to run service A. This also possibly explain why the MAC address was changed, perhaps due to a change in behaviour in some packages. However, attempting a downgrade went very poorly, as it is not a supported flow for Debian. I ended up hosing the entire system to the point where sudo no longer works, which left me with the only option of reinstall. While I was able to quickly recover the physical server's software fairly quickly as everything is done via ansible, the setup of the VM is a bit more tricky, so I have not done this yet. Fortunately, all the most important services are hosted directly on the physical server while the VM hosts only unimportant services, which means I can defer their recovery until at least tomorrow.
The root cause of this problem is installing rclone from sid. I needed to install the version from sid, as I'm relying on some features not available in the base version provided by buster. rclone is a Golang program, which means it has very few dependencies as a Debian package. The only dependency it has is libc6. Experienced system administrator may immediately see a problem: if the version of rclone in sid starts to require a higher minimum version of libc6 than what is available in stable/buster, then apt likely will immediately upgrade libc6 and (some of?) its reverse-dependent packages. Although I didn't confirm this is what happened this morning, it basically has to be the root cause [3]. The right solution is to install the version of rclone I need directly from rclone's downloads, which is how I fixed it in my Ansible playbooks. With the root cause determined, I realized that all my other servers are likely also messed up, which indeed they are, except the ARM-based ones that do not have access to sid and one server that failed to run the daily rclone install (due to failing its daily Ansible run).
[3] Please let me know if I'm wrong about this assessment.
Problems with software installed from sid has been a problem that I have repeated before: this is not the first time installing programs from sid have hosed my system. The last time this happened, I removed all sid packages except rclone from the Ansible playbooks. I thought rclone couldn't cause a problem due to its minimal dependencies. Evidentially I'm wrong; so lesson learned. Fortunately I was able to recover the most important stuff quickly, and still prepare for my exam tomorrow.
Tags: software
Comments? Email me at shuhao >at< shuhaowu <dot> com.