[The events described herein actually happened yesterday, but technically spilled over into today, so there you go. —Ed]
Well, that was certainly pleasant.
What was supposed to be a simple upgrade dominoed into a full scale fiasco.
But first, a bit of setup.
At The Company, we have seven name servers. Two are used by all the computers here to resolve DNS (Domain Name Service) queries only; no configuration changes are required on these two machines and therefore are not part of this story.
Four of the name servers are authoritative name servers for our domains. These machines will only respond to queries on the domains we host; all other queries (like recursive DNS queries) are ignored.
To make things easier on us, the remaining name server actually hosts all the zone files and pushes them out to the four authoritative name servers (in effect, the four authoritative name servers are slaves of this one server, but the outside world will never see this server). Therefore, we can make changes to the zone files on one server and have the changes automatically pushed out.
That still leaves the problem of new zones being added (which is a sore point with me with reguard to bind). Any new zone that's added, the configuration of not only the one master server needs changing, but the configuration file of the four authoritative (aka (Also Known As) “slave” servers) also require changing. While I have a script that will generate all five configuration files, we still need to copy four of the configuration files to each of the servers.
So I wanted to automate the copying of the new configuration files and restarting the name servers on all the authoritative name servers when the configuration files are created. And to do that, I needed to set up a trust mechanism so that the server that has all the zones can copy a configuration file and restart the nameservers without intervention. Easy enough to do with ssh.
But there were some … interoperability issues between the various machines with respect to their various instances of ssh (bascially, scp (secure copy) didn't work due to protocol differences). Easily solved by installing the latest version of OpenSSH [1] on each machine. Well, actually, installing the latest version of zlib [2], then OpenSSL [3] and then OpenSSH.
This master server, the one with all the zones, didn't need this upgrade, so I didn't bother with that. The four authoritative name servers, however, needed the upgrades. Now, I should mention at this point that the four authoritative name servers are all Cobalt RaQs—sure they're pretty old, but at 1U (rack Unit—1.75″) high and a low power consumption, they're fine for doing the dedicated task of resolving DNS queries.
The upgrade went smoothly on three of the machines—pretty much:
>
```
# cd zlib-1.2.3
# ./configure
# make
# make install
# cd ../openssl-0.9.8a
# ./configure
# make
# make install
# cd ../openssh-4.3p1
# ./configure
# make
# make install
# /etc/rc.d/init.d/sshd stop
# /etc/rc.d/init.d/sshd start
#
```
On the fourth machine (which happened to be the primary of our authoritative name servers) the make install of OpenSSH failed. Of which I didn't notice.
Oops.
The result of which was a borked program that refused to run, and no backup of the working version.
Oops.
Somehow, I ended up being logged out of the machine. And without a working sshd there was no way I could log back into the machine.
Well, not easily.
You see, the Cobalt RaQs don't have video or keyboard ports. They're designed as servers—they don't really need such devices. They do, however, have a serial port you can log in through.
So I hook up a serial cable from a nearby server to the RaQ in question and that's when I got hit with Murphy's Law [4] yet again—the serial login was disabled.
Hrm.
Okay. Take the machine out of the rack, take the drive out of the machine, hook it up to my workstation, change it so one can log in through the serial port, put the drive back in, power up the machine and log in through the serial port.
So I start to take the machine out of the rack when I get hit with Murphy's Law for a third time—one of the screws is stripped, so therefore I can't get it out of the rack.
Okay, now what?
I know that Linux (which is what runs on this Cobalt RaQ) can support a serial console. Maybe I can boot into single user mode and go from there.
Nice idea, but apparently the Linux kernel for these boxes don't support the serial console (as incredible as that may seem). Yes, I can see the shell prompt in single user mode, but everything I type just goes into the bit bucket (and this I try several times, with different arguments to the Linux kernel to try to get it to use a serial console). And each time I do this, I end up having to shut the machine down, which leaves the file system in an inconsistant state, requiring the use of fsck to fix.
Okay, so I really need to get the machine out of the rack. But how to do that? I'm looking at the situation when I get an idea: I'll attack the problem from a different angle. Literally! The Cobalt RaQ has two “wings” (one on either side) which are attached by screws, and it's these “wings” which are then screwed into the rack. I can get access to the screws holding the wing in place. So, I effectively remove the wing from the RaQ and it slides right out.
Then it's to my workstation. Open up the Cobalt RaQ, remove the drive, attach said drive to the external USB (Universal Serial Bus) drive case, turn it on, run fsck on the drive, edit the configuration to allow logins from the serial port, umount the drive, power it down, remove it from the USB drive, put it back into the RaQ, power it on and—
—have it fail to boot.
You see, when I “fixed” the drive using fsck on my workstation, it marked the drive as being a newer version of the filesystem. Which the fsck on the Cobalt RaQ doesn't support (as part of the boot up sequence, it automatically checks the drives using fsck).
Murphy strikes again.
Okay, attach the drive to my workstation, copy over a newer version of fsci, move the drive back to the RaQ and power up—
—only to have it fail yet again. Apparently, the old version of fsck used options that the new version of fsck doesn't like. So back to the workstation, modify the startup scripts to remove the options fsck is bitching about, try again, move the drive back to the workstation because I apparently edited the wrong startup scripts and try again only to find out I mucked it up again, so back to the workstation …
It was about half an hour of moving the drive back and forth before I got the RaQ to finally finished booting and to the point where I can log in sucessfully through the serial port.
Start the OpenSSH install from scratch.
>
```
# tar xzvf ../archive/openssh-4.3p1.tar.gz
# cd openssh-4.3p1
# ./configure
```
Only now configure failed!
Huh?
I check, and gcc failed with an internal error!
I try a quick C program and yup, gcc is totally borked now.
At this point, the only thing left is to reinstall the operating system.
Now, the installation procedure for the Cobalt RaQ 3 and 4s requires another PC, which you boot using a special CD (Compact Disc). The PC in question must have a single network port and one (1) CD drive. Anything else will confuse the installation CD. Once this CD is booted, you then force the Cobalt RaQ to do a netboot. Ths PC will see the netboot request, and feed it an installtion program which will install Linux on the Cobalt RaQ.
Because of the requirements of the installation CD, I have to use P's computer as it fits the requirements of the installation CD. But P's computer is on its last legs, sounding much like a dying diesel engine. But it's not dead yet.
Only it hung during the installation, having difficulty reading the CD.
I try it again. Same thing.
I turn off P's computer for half an hour. You know, let it cool down. Try it again. Same thing.
Smirk then suggested another computer in the office.
Same thing—it hangs.
It was then I remembered something from my past experiences with installing Cobalt RaQs: don't use the Cobalt RaQ 3 installation disks! They don't work. Use the Cobalt RaQ 4 installation disks instead (even if you are installing on a Cobalt RaQ 3, which I was).
That worked.
So now I had a fresh install of the operating system.
But no ssh.
Now, how to get files to the box … okay, the Cobalt RaQ has ftp. Okay, compile an FTP server on my workstation. Then use ftp to transfer zlib, openssl, openssh and bind to the Cobalt RaQ. Spend the next couple of hours compiling.
Finally had it back up and running and could finish the job I started some eight hours previously.
Blarg.