< aquanight> okay in what universe does an average cron job fuck up everything
This universe! Cron failure is my go-to example because it happened to me. One of those get paged at 2AM--I forget how much bad sleep I had had prior to that, probably not much--which caused N hours of immediate cleanup and M hours in subsequent work chasing down bugs and kindly asking the devs to checksum what they write, especially the payment batch files. This was a while ago, before the smartphone craze, so there were still pagers. Good times.
A root cause turned out to be that someone had written shell code along the lines of
#!/bin/sh ... $PID_FILE=`cat /who/knows/where.pid`
which as you know from unix shell 101 interpolates the PID_FILE first, so will run something like
=42
or
=
and regardless the PID_FILE never got set. The PID_FILE code was to prevent multiple instances of the script from running. A good idea, but here somewhat poorly implemented. The shell does warn about this,
$ $BLAH=42 ksh: =42: not found
but the shell script was on the order of 2,000 to 3,000 lines long, which is several orders of magnitude longer than where I would start looking for a different language to write the task in, and elsewhere in the code there were lines like
exec 2> error.log ... if [ $things_are_okay -eq 1 ]; then rm error.log; fi
the effect of which is that if the code thought that everything was fine, the logfile with the error was removed. This made the issue a bit tricky to debug and resulted in various unkind things being said.
DESCRIPTION The cron daemon schedules commands to be run at specified dates and times. Commands that are to be run periodically are specified within crontab(5) files.
so in the language of cron(8) one might have a line something like
0 * * * * /somewhere/process-payment-batches.sh
the gist of which is to at 11:00, 12:00, 13:00, every hour of every day to run the given script, come what may.
The trap is when the script takes more than, here, an hour to run--an hour? It would never take an hour, claims the optimist. Someone thought to have a PID_FILE check, which could be good if someone has started the script manually, maybe they started it at 13:59:47 and that instance is still running when 14:00 rolls around.
As it turned out, the shell script could do quite a lot of work, and the PID_FILE code never had worked, and the error message from the $PID_FILE=`...` code was always thrown away. The "lot of work" was some O(horrible) number that depended on how much prior data there was lying around, which in a certain season for a certain retailer could grow to a too large amount.
"Big O notation" is what O(...) represents. "horrible" is not a typical value, but is accurate enough here: the code could not scale beyond a very small amount of input without taking too long to run.
Eventually a cron job did take more than an hour to run, and every hour a new cron job would start, and when I was eventually able to login to the system and actually get a response from `uptime` the CPU load was something like 5,000 (on a dual processor linux system) and other code was happily writing random bytes for file descriptor X into the random file descriptor Zebra, ¿Qué?, or ROFL!! thus corrupting the loving heck out of the payment batch files.
"Normal Accidents" (1984) might be a good read.
tags #crontab #debug #sh