💾 Archived View for thrig.me › blog › 2023 › 07 › 24 › only-one-script.gmi captured on 2024-09-29 at 00:33:49. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2023-11-14)
-=-=-=-=-=-=-
Sometimes there is a need to run only one instance of a script. Additionally, one may want to know when the script last ran or if the script is running how long ago it started. Various implementations of this pattern have various problems: the lock could fail, allowing multiple instances to start, or a stale lock could prevent any instances from starting. Failures involve manual intervention, and there may be additional cleanup work depending on exactly what went awry. Another thing to worry about is a script simply not running because 02:30 AM on a particular Sunday does not exist, or exists twice—thanks, daylight saving time!
There are many ways to implement locking poorly, and decisions to be made: do you keep trying for the lock? Once, or for how many tries? Do you then quietly give up, or fail nosily?
#!/bin/sh echo "$ try" test -e lockfile && exit 1 # whoops, race condition here touch lockfile echo "$ lock" sleep 99 rm lockfile echo "$ unlock"
However, if the script crashes or is killed there will be a stale lockfile, so try to handle that.
#!/bin/sh echo "$ try" cleanup() { rm lockfile; } test -e lockfile && exit 1 trap cleanup EXIT INT TERM HUP # whoops, race condition here touch lockfile echo "$ lock" sleep 10
This is still bad; the operating system could crash, and on boot there would be a stale lockfile. How long might that take to be noticed? Also the script could be killed with the KILL signal, which does not allow for any cleanup. And there is still a race condition between the file check and the creation where another instance could start, not find the lockfile, and now you have two instances running. Rare, but not impossible, and less rare as the system becomes more heavily loaded.
Maybe instead of removing the file the flock(2) interface could be used; this way the lock would go away when the locking process exits, crashes, or even when the operating system crashes.
$ doas pkg_add flock ... $ flock --nonblock lockfile sleep 3 & [1] 77203 $ flock --nonblock lockfile echo hi $ flock --nonblock lockfile echo hi $ flock --nonblock lockfile echo hi $ flock --nonblock lockfile echo hi hi [1] + Done flock --nonblock lockfile sleep 3
In this case there would be a flock process wrapping the entire script that needs a lock. Anything more complicated probably needs a language that has internal support for locking, ideally without the race conditions that shell code is prone to.
A perl script can use the special DATA filehandle to lock itself, though this may not be portable to weird operating systems. A benefit here is that there is no external lockfile to wrangle, nor expensive forks out to flock(1) or some other program.
#!/usr/bin/perl # lock-self.pl - how to lock the script itself use 5.36.0; use Fcntl ':flock'; eval { local $SIG{ALRM} = sub { die "timeout\n" }; alarm 1; say "$\ttry lock"; flock *DATA, LOCK_EX; alarm 0; 1; } or do { die unless $@ eq "timeout\n"; say "$\tno lock"; exit 0; # meh }; say "$\tgot lock"; sleep 3; say "$\trelease lock"; __DATA__ this is for the lock, do not remove
There is a lot of boilerplate here that probably should be hidden off in a library.
$ ./lock-self.pl & ./lock-self.pl & ./lock-self.pl & sleep 5 ; echo [1] 72753 [2] 39509 [3] 24532 24532 try lock 24532 got lock 72753 try lock 39509 try lock 72753 no lock 39509 no lock 24532 release lock [3] + Done ./lock-self.pl [2] - Done ./lock-self.pl [1] Done ./lock-self.pl
A database may bring Atomicity and other features, and also maybe a lot more complexity than fiddling with files on a filesystem. For one, a locking table could easily have more metadata associated with a lock, though one might also write JSON into lockfile (after properly locking it). A database could support locks from multiple hosts, though there is other software in this space.
#!/usr/local/bin/tclsh8.6 package require sqlite3 sqlite3 mylock lock.db mylock eval { CREATE TABLE lock ( lock TEXT, epoch INTEGER ) }
#!/usr/local/bin/tclsh8.6 package require sqlite3 sqlite3 mylock lock.db puts "[pid] start" if { [ catch { mylock transaction exclusive { after 3333 mylock eval "INSERT INTO lock VALUES('foo',[clock seconds])" } puts "[pid] done" } ] } { puts "[pid] nope" }
$ ./create.tcl $ ./lock.tcl & ./lock.tcl & ./lock.tcl & sleep 5 ; echo [1] 51759 [2] 64289 [3] 53458 51759 start 64289 start 64289 nope 51759 nope 53458 start 53458 done [3] + Done ./lock.tcl [2] - Done ./lock.tcl [1] Done ./lock.tcl $ sqlite3 lock.db SQLite version 3.41.0 2023-02-21 18:09:37 Enter ".help" for usage hints. sqlite> select * from lock; foo|1690168116 sqlite>
Bootstrapping a local lock database might be something good for configuration management such as Ansible to do. Configuration management may also have some means to perform locks itself, but that might tie you too strongly to that particular software. This code locks the whole table; more clever things could probably be done with database locks.
Consider a script that runs rsync every five minutes; this has various risks such as the script taking more than five minutes to complete.
*/5 * * * * run-some-rsync-command
A lockfile is one solution, though is not the only option; another is to run another rsync command after five minutes. This shifts the problem to starting only one instance of the script that does the run-then-sleep pattern, but that could be launched as a startup daemon.
#!/bin/sh while :; do run-some-rsync-command sleep 300 done
That the rsync command may take too long to complete remains a problem; the rsync command may need various timeout options or to be wrapped with timeout(1) so that it does not wedge. There may need to be monitoring here so someone can be notified that the rsync is being slow or is not getting the work done. One method would be to touch a status file, and then a monitoring script could check if the status file is older than, say, 15 minutes and if so notify someone.
#!/bin/sh while :; do timeout 300 run-some-rsync-command || notify-about-timeout touch /var/run/rsync-was-done sleep 300 done
Service agreements may dictate how tight the monitoring is, and how noisy the notification should be, though too many false positives from monitoring will either burn out the on-call, or train them to ignore the warnings. That's an inclusive or, by the way.
A touch of a file may be too simple; slightly more complex would be to increment a counter on each failure, and zero it on success, or to put the result into an round robin database so there would be a recent list of successes and failures to review. Monitoring could then fire if there were, say, two failures in a row, or maybe if the failure rate is above 50% in ten attempts. This of course may need tuning to cut down on false positives while still surfacing problems in a reasonable amount of time.
The window here is a period of time during which the job must not be run again. A database or sticking JSON into a file should work here. Using the mtime on a file may or may not work out.
#!/usr/local/bin/tclsh8.6 package require sqlite3 sqlite3 mylock lock.db mylock eval {CREATE TABLE lock (lock TEXT UNIQUE, epoch INTEGER)}
#!/usr/local/bin/tclsh8.6 namespace path ::tcl::mathop package require sqlite3 3.24.0 sqlite3 mylock lock.db proc with-lock-window {name window body} { catch { set now [clock seconds] mylock transaction exclusive { mylock eval {SELECT epoch FROM lock WHERE lock=$name} lock { if {$now <= [+ $lock(epoch) $window]} return } uplevel 1 $body mylock eval { INSERT INTO lock(lock,epoch) VALUES($name,$now) ON CONFLICT(lock) DO UPDATE SET epoch=$now } } } } with-lock-window foo 5 { puts ok }
One downside of the above code is that both lock failures and database errors are ignored. How robust is your test suite, and how critical is the code involved?
Can a lock be avoided? There are various things to try here; maybe put items on a queue, and a queue runner deduplicates anything that was added two or more times. There are trade-offs; a queue runner might take slightly longer to get around to doing the work as compared to a script that runs right away, unless something is already running. Or perhaps files can be atomically renamed when the work is complete, like how Maildir writes new messages.
Or maybe there's a window to raise instead of a program to run?
#!/bin/sh xdotool search --name brogue windowactivate 2>/dev/null && exit exec brogue
Be sure to test the heck out of any locking code, and ideally stash that code into a standard library so lots of things can benefit from it. No, a bad lock did not result in a linux system with a load of 5,000 and corrupted payment batch files. Not at all!
tags #monitoring #scripting