💾 Archived View for thrig.me › blog › 2023 › 09 › 28 › busy-stuck-process.gmi captured on 2024-08-18 at 18:05:03. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2023-11-14)
-=-=-=-=-=-=-
Obviously, yes, if I'm asking the question. The typical answer is "no" because if a program is stuck it cannot be doing anything, because it is stuck. Logic! So let's first devise a program that "gets stuck". One way is to block on a pipe read. There are other ways this can happen.
#include <err.h> #include <stdio.h> #include <unistd.h> int main(void) { char buf[1]; int fds[2]; if (pipe(fds) != 0) err(1, "pipe"); while (1) { read(fds[0], buf, 1); fprintf(stderr, "not stuck\n"); } }
$ make stuck && ./stuck cc -O2 -pipe -o stuck stuck.c ...
Elsewhere, top(1) might report something along the lines of:
$ top -d 1 all| awk '/PID/{print}/stuck/{print}' PID USERNAME PRI NICE SIZE RES STATE WAIT TIME CPU COMMAND 38172 jmates -6 0 172K 856K idle piperd 0:00 0.00% stuck
Is this process stuck stuck? WAIT is on "piperd", and it's idle. It's not getting through the while loop, that's for sure (or the "printf debugging" to stderr is broken, which is unlikely but not impossible, but easy to test by removing the blocking read). Not getting through the main loop could be considered stuck. Now we need a process that does work, while also being stuck.
$ CFLAGS=-pthread make stuckbusy && ./stuckbusy cc -pthread -o stuckbusy stuckbusy.c ...
So this is a strange process according to top(1); it is stuck in wait on "piperd", will (usually) show as being in the "idle" state, but will (sometimes) show CPU usage, and will accumulate CPU time as it runs. (Other process tools on other unix may show different things.) The columns have been collapsed a bit so that they better fit in an 80-column display.
$ top -d 1 all| sed -n '6p;6q' PID USERNAME PRI NICE SIZE RES STATE WAIT TIME CPU COMMAND $ for i in `jot 4`; do top -d 1 all| grep stuck; sleep 1; done 45647 jmates -6 0 2604K 1532K idle piperd 0:02 0.05% stuckbusy 45647 jmates -6 0 2604K 1532K idle piperd 0:02 0.05% stuckbusy 45647 jmates -6 0 2604K 1532K idle piperd 0:02 0.00% stuckbusy 45647 jmates -6 0 2604K 1532K idle piperd 0:02 0.10% stuckbusy
Threads obviously complicate matters, as one part of a program might be stuck or wedged while other parts continue to work. top(1) and other such tools generally report on wafer-thin snapshots of the process state, so can show contradictory or confusing information for such processes. More detail may be obtained process tracing (ktrace, strace, etc) which can be hilariously verbose, but at least will indicate what system calls are being made. Or you may need other forms of debugging, such as taking thread dumps in Java, etc.
$ kdump -f ktrace.out | sed 22q 45647 stuckbusy RET nanosleep 0 45647 stuckbusy CALL nanosleep(0x2b873a32018,0) 45647 stuckbusy STRU struct timespec { 0.000099999 } 45647 stuckbusy RET nanosleep 0 45647 stuckbusy RET nanosleep 0 45647 stuckbusy RET nanosleep 0 45647 stuckbusy CALL nanosleep(0x2b841317a78,0) 45647 stuckbusy CALL nanosleep(0x2b8a49f8868,0) 45647 stuckbusy CALL nanosleep(0x2b917a8ed98,0) 45647 stuckbusy STRU struct timespec { 0.000099999 } 45647 stuckbusy STRU struct timespec { 0.000099999 } 45647 stuckbusy STRU struct timespec { 0.000099999 } 45647 stuckbusy RET nanosleep 0 45647 stuckbusy RET nanosleep 0 45647 stuckbusy RET nanosleep 0 45647 stuckbusy RET nanosleep 0 45647 stuckbusy CALL nanosleep(0x2b873a32018,0) 45647 stuckbusy CALL nanosleep(0x2b8a49f8868,0) 45647 stuckbusy CALL nanosleep(0x2b917a8ed98,0) 45647 stuckbusy CALL nanosleep(0x2b841317a78,0) 45647 stuckbusy STRU struct timespec { 0.000099999 } 45647 stuckbusy STRU struct timespec { 0.000099999 } ...
Wacky states like this do occur in production, where there may be a daemon that accepts connections but does not get any work done, or one that uses blistering amounts of CPU but is stuck according to some metric. Debugging may take longer if folks roll to disbelieve the metrics—"a stuck process that uses CPU? that's unpossible!"—instead of digging in to figure out what exactly is going on.
PID USERNAME PRI NICE SIZE RES STATE WAIT TIME CPU COMMAND 84474 jmates 10 0 17M 1740K onproc/9 nanoslp 2:25 519.58% stuckveryb
Wait says that it is sleeping, but it is also using a bit of CPU. Other unix might be interesting to investigate to see how their process tools report such wacky conditions. Or what other strange process status states can you create?