I felt like I was in an episode of House [1].
I found myself at The Data Center, waiting for one of our customers, R, to show up to let him in (he forgot the access code). While there, I was attempting to extracate a KVM (Keyboard, Video, Mouse) cable he could use when I pulled the wrong cable and unplugged a power strip.
The upshot: I took down some of R's equipment that wasn't having problems.
Sigh.
R shows up, and we check on the equipment that experienced the unplanned power outtage and one of his Linux boxes was in trouble. It was running Asterisk [2] and it had the most amusing problem: it kept core dumping on an illegal instruction and upon crashing, would restart itself [3].
But in troubleshooting that problem, it became rather apparent something else was terribly wrong:
>
```
GenericUnixRootPrompt# df
Filesystem 1K-blocks Used Available Use% Mounted on
GenericUnixRootPrompt#
```
Nothing mounted, but I could still see files. fdisk showed two partitions, /dev/hda1 and /dev/hda2. fsck worked fine on /dev/hda1 but failed on /dev/hda2 since it didn't know what type of filesystem was on it. Odder still, /dev/hda1 was the boot partition, containing only the kernel and related files required for the initial operating system boot, but yet, here I was, in a shell, running Unix commands like fsck and fdisk and more.
Yet fsck and even mount had no idea what type of filesystem was on /dev/hda2.
Yet, it must be the root filesystem, which I was currently using, because /dev/hda1 didn't have fsck, mount, more much less /bin/bash.
Worse still, what I did have, including /tmp, was in “read-only” mode.
The Asterisk crashing problem would have to wait.
I was able to get the box on the network and backup everything to another system. While that was chugging along (took about an hour) I realized that the system was somehow mounting /dev/hda2, otherwise there'd be nothing to backup. Checking /etc/fstab didn't help much:
>
```
GenericUnixRootPrompt# more /etc/fstab
# This file is edited by fstab-sync - see 'man fstab-sync' for details
/dev/hdb1 /media/cdrom auto user,noauto 0 0
GenericUnixRootPrompt#
```
I then checked /boot/grub/grub.conf (since something was being mounted as the root filesystem) and found that the root partition wasn't /dev/hda2 but something like /dev/VolGroup00/LogGroup00. Using that I was able to check and remount the fileystem as read/write. I was then able to add that to /etc/fstab, reboot the system and have it come up fine, thus saving R from having to nuke-n-pave the system. How /etc/fstab ended up without the root filesystem is something I don't know (but I suspect it may have been trying to update that file when the power was cut—hey, it's as good a theory as anything), but at least the system was back up and running.
That just left the little problem of Asterisk continously dumping core in an illegal instruction. A recompile of the program (since R and I thought maybe the executable was corrupted) didn't solve the problem. A compile of the lastest version didn't solve the problem, but we did notice that there were a few modules for Asterisk installed that don't come with the default install of Asterisk. And one of those modules had pentium4-sse3 in the name.
I checked the box—it was a Pentium IV with SSE2, not a Pentium IV with SSE3.
That would definitely explain the crashing.
It seems that R hired someone to install a particular codec for Asterisk and they grabbed the wrong version (or rather, the version for the wrong processor) and the only reason Asterisk hadn't crashed was that it hadn't actually been loaded into Asterisk. Well, until the reboot that is. We removed that module and Asterisk started up fine.
And then it was time to turn to the problem that R had come to The Data Center to investigate …