If builders built buildings the way programmers wrote programs, then the first woodpecker that came along would destroy civilization.
Sad to say that's the first thing that came to mind at the end of tonight's (or rather, this morning's) adventures.
Around midnight the Data Center In Boca Raton fell off the face of the Internet. I caught it just as it happened (checking things out on the new router I installed at a customer site some six hours earlier) and by the time I left a voice mail message to our upstream and talked to Smirk (he called as I was leaving the voice mail message), the Data Center In Boca Raton was back on the Internet.
Shortly after that, I was scanning the logs from snmptrapd (I have all our routers sending SNMP (Simple Network Management Protocol) traps to a central server) I got fed up with seeing stuff like:
>
```
2009-12-06 06:28:08 XXXXXXXXXXXXXXXXXXXXXXXXX [XXXXXXXXXXXXXX]:
SNMPv2-MIB::sysUpTime.0 = Timeticks: (125022780) 14 days, 11:17:07.80
SNMPv2-MIB::snmpTrapOID.0 = OID: SNMPv2-SMI::mib-2.14.16.2.10
SNMPv2-SMI::mib-2.14.1.1 = IpAddress: XXXXXXXXXXXXXX
SNMPv2-SMI::mib-2.14.7.1.1 = IpAddress: XXXXXXXXXXXXXX
SNMPv2-SMI::mib-2.14.7.1.2 = INTEGER: 0
SNMPv2-SMI::mib-2.14.10.1.3 = IpAddress: XXXXXXXXXXXXXX
SNMPv2-SMI::mib-2.14.16.1.3 = INTEGER: 4
SNMPv2-SMI::mib-2.14.4.1.2 = INTEGER: 1
SNMPv2-SMI::mib-2.14.4.1.3 = IpAddress: XXXXXXXXXXXXXX
SNMPv2-SMI::mib-2.14.4.1.4 = IpAddress: XXXXXXXXXXXXXX
```
(only on one line). It makes it hard to figure out what the heck the router is complaining about and I wanted to change the format the MIB ( Management Information Base)s to make them easier to read. I changed the command line options to snmptrapd only to get:
>
```
/usr/sbin/snmptrapd: symbol lookup error: /usr/lib/libnetsnmpmibs.so.5:
undefined symbol: netsnmp_TCPIPv6Domain
```
Mind you, it took a good ten minutes of scratching my head over why /etc/init.d/snmptrapd start wasn't before trying to run it at the command line.
All I know—it was running fine a few minutes before, but not now. I guess something changed in the 130 days since the server rebooted (my guess: a new version of snmptrapd without a corresponding new version of some library—did I mention I hate package managers?). No problem, as I had a locally installed copy in /usr/local/sbin/snmptrapd I could use.
I rebooted the server (it's a virtual server—takes less than a minute) when I noticed some odd issues with syslogd.
Okay, I'm not running the default syslog that comes with the distribution—no, I've been testing a homegrown syslog (which I will get around to talking about—it's quite cool) and it was basically hanging when starting up (enough that some program called minilogd was starting up, even though I have no XXXXXXX clue as to what is starting it—I can't find any reference to it in the startup scripts).
Eventually, I figure out it's blocking on a DNS (Domain Name System) lookup (I'm relaying syslog traffic to a centralized server, but that's, as Alton Brown [1] says, is another show), which is odd, because DNS hasn't been an issue.
I check, and I see I'm only using one of the two DNS resolvers we have.
I can't resolve.
I can ping the DNS server from the server I'm on.
I can ssh to the DNS server from the server I'm on.
I just can't resolve DNS queries.
Now, the DNS resolver and the server I'm on are both virtual servers.
On the same physical computer.
The other resolver?
That's a virtual server on another physical computer and yes, I can resolve fine using that (so I set the default DNS resolver to be the one that is working while I try to troubleshoot the current issue that shouldn't be happening).
We used to have an issue with some virtual servers using that virtual DNS resolver, but I thought we had that licked months ago.
Maybe it's back?
I check iptables everywhere and no … should be fine.
A couple of hours go by.
I've finally isolated the issue—the resolver itself can't resolve.
But the other one can.
It was then I noticed some odd messages being logged to syslog and coming from our monitoring system [2]:
>
```
HOST ALERT: XXXXXXXXXXXXXXX;DOWN;SOFT;14;
(No Information Returned From Host Check)
HOST ALERT: XXXXXXXXXXXXXXX;DOWN;SOFT;15;CRITICAL
- Host Unreachable (XXXXXXXXXXXXX)
HOST ALERT: XXXXXXXXXXXXXXX;DOWN;SOFT;16;CRITICAL
- Host Unreachable (XXXXXXXXXXXXX)
```
Hmm … our monitoring system in Charlotte can't reach our resolver … okay, let's do a traceroute from Charlotte to the resolver and—
OH XXXXX XXXXXXX XXXXX ON A XXXXXXX XXXX XXXXX! No wonder I'm having DNS issues—the netblock the resolvers are on isn't being announced! WXXX TXX FXXX‽
That little outtage around midnight? Apparently our upstream's upstream had a slightly larger issue and couldn't route (what turned out to be) a few of our netblocks. We do have multiple connections to the Internet, but … well … it's a long story, but basically, just running BGP (Border Gateway Protocol) isn't enough—no, we have to send authorization emails to have the other provider to announce our routes that normally go through the one that had (and was still having) issues.
Okay, so the problem(s) at hand. The fact that the netblock our DNS resolvers were on weren't being announced would explain why the one resolver couldn't even resolve using itself; the other resolver probably had a larger working DNS cache and never had to send a query.
I swear, the number of “moving parts” a modern networked computer has to deal with is amazing, and it's amazing it works at all as well as it does, when it does. But man, when it breaks, it breaks and it's a bitch to troubleshoot (especially when you're doing it remotely—why even suspect the network in such a case?).