I mentioned briefly yesterday about the issue I was having with syslogintr [1] while booting the computer. On my system, the system would hang just after loading syslogintr. I tracked it down to initlog [2] hanging. Further investigation revealed that both syslogintr and initlog were hanging, but the significance of that escaped me until an epiphany I had while sleeping: I was experiencing yet another race condition! [3]
A quick test today proved that yes, it was a race condition. A particularly nasty race condition too, since once again, I wasn't explicitly writing multi-threaded code.
syslogintr creates a local socket (/dev/log for those [DELETED-that-DELETED] who are curious) and then waits to receive logging messages sent to said socket, something like:
>
```
local = socket(...)
while(!interrupted)
{
read(socket,buffer,sizeof(buffer));
process(buffer);
}
```
But in the process of processing the incoming message, syslogintr may itself call syslog():
>
```
while(!interrupted)
{
read(socket,buffer,sizeof(buffer));
process(buffer);
syslog(LOG_DEBUG,"processed message");
}
```
syslog() (which is part of the standard library under Unix) sends the message to the local socket (/dev/log). The data is queued up in the socket, but it's okay because it'll cycle around quickly enough to pick up the new data. Unless there's too much data already queued in the local socket, at which point whoever calls syslog() will block until the backlogged data in the local socket is dealt with.
The startup script (/etc/init.d/syslog for those of you following along at home) starts both syslogintr and klogd. klogd is a program that pulls the data from the kernel logging queue (logging messages the kernel itself generates, but the kernel can't use /dev/log as that's just a Unix convention, not something enforced by the kernel itself) and logs that data via syslog(). And by the time klogd starts up, there's quite a bit of logging data generated by the kernel. So that data gets blasted at syslogintr (and in the process, so much data is being sent that klogd is blocked from running). But syslogintr is still coming up to speed and generating a bunch of internal messages and suddenly, its calls to syslog() are blocking, thus causing a deadlock:
>
```
while(!interrupted)
{
read(socket,buffer,sizeof(buffer));
process(buffer); /* this takes some time */
/*--------------------------------------------------
; meanwhile, something like klogd could be blasting
; data to the local socket, filling it up, thus when
; we get to:
;---------------------------------------------------*/
syslog(LOG_DEBUG,"processed message");
/*-------------------------------------------------
; the call to syslog() blocks, thus blocking the
; program until something else (in this case, *us*)
; drains the data waiting in the socket. But we
; can't drain the data because we're waiting (via
; syslog()) for the data to be drained!
;
; Can you say, *deadlock* boys and girls? I knew
; you could.
;--------------------------------------------------*/
}
```
This also explains why it only happened when booting—because that's about the only time so much data is pushed to syslogintr that socket operations (reading, writing) are blocked. It also explains why I haven't seen it on any other system I'm running it on, since those systems don't run klogd (being virtual hosts, they don't have klogd).
If you've ever wondered why software tends to crash all the time, it's odd interactions like this that are the cause (and this was an easy problem to diagnose, all things considered).
So now I internally queue any logging messages and handle them in the main loop, something along the lines of:
>
```
while(!interrupted)
{
foreach msg in queued_messages
process(msg);
read(socket,buffer,sizeof(buffer));
process(buffer);
queuelog(LOG_DEUBG,"processed message");
}
```