Millions of moving parts

In a system of a million parts, if each part malfunctions only one time out of a million, a breakdown is certain.

“—Stanislaw Lem”

In between paying work, I'm getting syslogintr [1] ready for release—cleaning up the Lua [2] scripts, adding licensing information, making sure everything I have actually works, that type of thing. I have quite a few scripts that isolated some aspects of working scripts—for instance, checking for ssh attempts and blocking the offending IP (Internet Protocol) but weren't fully tested. A few were tested (as I'm using them at home), but not all.

I update the code on my private server, rewrite its script to use the new modules (as I'm calling them) only to watch the server seize up tight. After a few hours of debugging, I fixed the issue.

Only it wasn't with my code.

But first, the scenario I'm working with. Every hour, syslogintr will check to see if the webserver and nameserver are still running (why here? Because I can, that's why) and log some stats gathered from those processes. The checks are fairly easy—for the webserver I query mod_status [3] and log the results; for the nameserver, I pull the PID (Process ID) from /var/run/named.pid and from that, check to see if the process exists. If they're both running, everything is fine. It was when both were not running that syslogintr froze.

Now, when the appropriate check determines that the process isn't running it not only logs the situation, but sends me an email to alert me of the situation. If only one of the two processes were down, syslogintr would work fine. It was only when both were down that it froze up solid.

I thought it was another type of syslog deadlock [4]—Postfix [5] spews forth multiple log entries for each email going through the system and it could be that too much data is logged before syslogintr can read it, and thus, Postfix blocks, causing syslotintr to block, and thus, deadlock.

Sure, I could maybe increase the socket buffer size, but that only pushes the problem out a bit, it doesn't fix the issue once and for all. But any real fix would probably have to deal with threads, one to just read data continuously from the sockets and queue them up, and another one to pull the queued results and process them, and that would require a major restructure of the whole program (and I can't stand the pthreads API (Application Programming Interface)). Faced with that, I decide to see what Stevens [6] has to say about socket buffers:

With UDP (User Datagram Protocol), however, when a datagram arrives that will not fit in the socket receive buffer, that datagram is discarded. Recall that UDP has no flow control: It is easy for a fast sender to overwhelm a slower receiver, causing datagrams to be discarded by the receiver's UDP …

Hmm … okay, according to this, I shouldn't get deadlocks because nothing should block. And when I checked the socket receive buffer size, it was way larger than I expected it to be (around 99K if you can believe it) so even if a process could be blocked sending a UDP packet, Postfix (and certainly syslogintr wasn't sending that much data.

And on my side, there wasn't much code to check (around 2300 lines of code for everything). And when a process list showed that sendmail was hanging, I decided to start looking there.

Now, I use Postfix, but Postfix comes with a “sendmail” executable that's compatible (command line wise) with the venerable sendmail [7]. Imagine my surprise then: