Whist trying out some new terminal IRC client on the grounds that irssi is both long of tooth and short on security I noticed that the new IRC client had a feature to edit a message using the EDITOR environment variable. This is not very notable. That the CPU usage of that editor then went to 100% is somewhat notable. Normally the editor does not use 100% CPU--it was not written at Microsoft--so presumably the issue has to do with the environment that the editor, ex, is being run in.
EDITOR and VISUAL are two different environment variables. An application should check if VISUAL is set, and use that, otherwise check if EDITOR is set, and use that, otherwise the standard editor, ed. In practice, many assume EDITOR is the only variable worth checking. Even more assume that a visual environment is available, which is most likely true. Back in the good old days a low baud modem might not have been fast enough to run a visual editor such as vi, or only a line printer might have been available. Hence, EDITOR and VISUAL being two different things.
Note that Linux vendors may do wacky things such as not installing ed(1) or linking vi(1) to nano by means of a hilariously complicated alternatives system. If you ever have the misfortune of using such a system, try installing busybox and then use `busybox vi`.
Note also that various programs peek at EDITOR and may change their behavior, e.g. ZSH when trying to figure out what bindkey flavor to use. This can make debugging things more exciting.
$ env | egrep 'EDITOR|VISUAL' VISUAL=/home/jmates/usr/OpenBSD7.2-amd64/bin/vi EDITOR=/home/jmates/usr/OpenBSD7.2-amd64/bin/ex
Anyways, an editor launched by an IRC client running in a terminal can be tricky to debug. Various methods are possible; without time pressure a good option is to sit on your hands, or to otherwise prevent the use of the computer. What then? Think about the problem, form a hypothesis, make a plan. If one instead leaps into a favorite tool, there is probably no thought, or not much thought beyond "my code is perfect, how could this be?" That is to say, denial. With a little thought (and some knowledge) one might suspect that the editor is spinning through an event loop. A brief look with a processing tracing tool might help confirm this.
Different problems may point in different directions, like maybe it's a load balancer, or DNS, or something totally unrelated to wading in with a debugger. And how would you know that without sitting on your hands and thinking about the problem a bit?
With a hypothesis--event loop spin--we can then devise various tests and new questions, such as "where the heck is the event loop for ex" which you probably have not memorized, or the test "is this an event loop spin problem, or something else?"
... /home/jmates/tmp/.tmpH4TNLF: unmodified: line 3 :!tty /dev/ttyp4 ! :
Knowing the tty the program runs in helps isolate the exact process, as there might be multiple processes open, which in fact there are, as there is a different test ongoing in another terminal that will be covered later. A backtrace is good to have. This on OpenBSD requires root, and otherwise may not be possible under various conditions. If possible, it does give a good list of files and line numbers to look at.
$ ps auwwx | grep ex | grep p4 jmates 99147 59.3 0.0 1328 2732 p4 R+p/2 8:53AM 0:18.88 /home/jmates/usr/OpenBSD7.2-amd64/bin/ex /home/jmates/tmp/.tmpH4TNLF $ doas egdb -q -p 99147 `which ex` Reading symbols from /home/jmates/usr/OpenBSD7.2-amd64/bin/ex... ... (gdb) bt #0 _thread_sys_read () at /tmp/-:3 #1 0x00000455248401ee in _libc_read_cancel (fd=fd@entry=0, buf=buf@entry=0x454e3424400, nbytes=nbytes@entry=255) at /usr/src/lib/libc/sys/w_read.c:27 #2 0x000004528d90020f in cl_read (blen=256, tp=<optimized out>, nrp=<synthetic pointer>, bp=<optimized out>, flags=<optimized out>, sp=0x454e34267a0) at /home/jmates/src/vi/build/../cl/cl_read.c:65 #3 cl_event (sp=0x454e34267a0, evp=0x7f7ffffc8760, flags=0, ms=0) at /home/jmates/src/vi/build/../cl/cl_read.c:97 #4 0x000004528d91c2dd in v_event_get (sp=sp@entry=0x454e34267a0, argp=argp@entry=0x7f7ffffc8760, timeout=timeout@entry=0, flags=flags@entry=0) at /home/jmates/src/vi/build/../common/key.c:499 #5 0x000004528d916afe in ex_txt (sp=sp@entry=0x454e34267a0, tiqh=tiqh@entry=0x454e3426868, prompt=prompt@entry=58 ':', flags=flags@entry=4196896) at /home/jmates/src/vi/build/../ex/ex_txt.c:100 #6 0x000004528d906a3a in ex (spp=<optimized out>) at /home/jmates/src/vi/build/../ex/ex.c:102 ... (gdb) q
Another method, especially if the program lacks debugging symbols or if debugging is not possible--there may be only a bug report, and certainly no access to the customer's system--is to find where the event loop is. A debug build does help with this, as one can run the program under the debugger, find its PID, then hit it with a STOP signal to drop into the debugger:
$ ln build/vi build/ex $ egdb -q build/ex Reading symbols from build/ex... (gdb) r Starting program: /home/jmates/src/vi/build/ex /tmp/vi.xcb7P3j65k: new file: line 1 : Program received signal SIGSTOP, Stopped (signal). _thread_sys_read () at /tmp/-:3 3 /tmp/-: No such file or directory. (gdb)
A backtrace then shows a similar story, so we probably should look at the files and lines involved to determine what exactly the code is doing.
A slower way to do this is to study the code and trace the execution down from main() to the event loop. In some programs this may be complicated by there being multiple main() calls for different platforms or for SDL verses terminal versions of the code.
Common problems at this stage include debugging the wrong binary (did the build system cache something out of date and not tell you?) or debugging the right program but on the wrong host, etc. One trick is to stick an abort() or similar into code to prove that changes to the code are being recognized and encountered. If you have production systems, please take extra special caution to confirm that you are working in the right environment.
Given that the editor goes awry in one environment but not another, another method would be to discover any differences between those environments (the environment variables, the terminal state via `stty -a`, etc), how the file descriptors are wired up (`fstat -p ...`), etc. One may also be interested in exactly how "forking and execing an editor with a temporary file" is done in the problematic process.
As mentioned somewhere above, process tracing will help show what a process is doing at the system call level.
... 42136 ex RET read -1 errno 35 Resource temporarily unavailable 42136 ex CALL read(0,0x3d94b585400,0xff) 42136 ex RET read -1 errno 35 Resource temporarily unavailable 42136 ex CALL read(0,0x3d94b585400,0xff) 42136 ex RET read -1 errno 35 Resource temporarily unavailable 42136 ex CALL read(0,0x3d94b585400,0xff) ... $ doas pkg_add moreutils ... $ errno 35 EAGAIN 35 Resource temporarily unavailable $ man read | fgrep -1 EAGAIN [EAGAIN] The file was marked for non-blocking I/O, and no data were ready to be read.
ktrace(1) shows (with a lot of output!) that there is a read(2) busy loop caused by EAGAIN again and again and ...
Two lines of attack are possible; first, to fix the terminal IRC client to not setup a problematic environment, and second, to correct ex(1) to behave correctly in the problematic environment. Both approaches have merit; there may be other programs besides ex(1) that misbehave, and ex(1) may misbehave when run under other random programs that setup a problematic environment.
Our hypothesis (remember to sit on your hands, or wash the dishes, or to take that walk!) might be that the socket being read from is usually blocking, but has been set by the terminal IRC client to be non-blocking. A kluge fix in ex would be to check for EAGAIN and sleep before looping, which would at least keep the CPU usage low until a better fix can be devised. Assuming you knew where to put such a delay.
As it turns out, the backtrace does not help much beyond outlining the files and calls involved; progress might be made by sprinkling breakpoints about the tree, continuing the process, and entering commands until a suitable breakpoint is tripped over?
Another problem is that a terminal IRC client may be difficult to iterate on, even if you point it only at a local irc server. Certainly after a provisional fix a full client test is in order, but more rapid iteration could be had with a mock program that runs ex with various filehandles set to blocking or non-blocking. This will also give independent verification of the problem, if we can reproduce it, which may help with a bug report.
This was a much more productive line than messing around in the debugger, as it quickly confirmed the issue as caused by the parent program setting O_NONBLOCK on the STDIN_FILENO. And it created a minimal reproducible test case in about 30 lines of code.
And, based on that, a provisional patch.
However, this patch was rejected by the OpenBSD folks; every tool would need to make a similar check in main() to see if stdin has been set to non-blocking, and to try to restore the previous state, which may not be possible if the program crashes, or exec's something else.
As it turns out, non-blocking was set in the terminal IRC client to workaround a WSL bug; therefore, Microsoft continues to waste my time even if I do not touch their products. Ugh!
https://github.com/osa1/tiny/issues/269
https://github.com/microsoft/WSL/issues/3507
On the plus side I now have some code to test for and warn about this condition.