The lead I thought I had [1] was a red herring. I thought it may have had something to do with reporting errors back to the client as seen from the logs:
Dec 04 21:44:38 daemon info 71.19.142.20 gopher maxco=1 runco=0 toq=0 sc=1 mem=3012577 Dec 04 21:46:23 daemon err 71.19.142.20 gopher stat("/home/spc/gopher/share/MGLNDD_71.19.142.20_70") = No such file or directory Dec 04 21:46:23 daemon info 71.19.142.20 gopher remote=XXXXXXXXXXXXXXX status=false request="MGLNDD_71.19.142.20_70" bytes=68 Dec 04 21:49:38 daemon info 71.19.142.20 gopher maxco=2 runco=0 toq=0 sc=1 mem=2903218 Dec 04 21:50:52 daemon info 71.19.142.20 gopher remote=XXXXXXXXXXXXXXX status=true request="CONNECT api64.ipify.org:443 HTTP/1.1" bytes=562 Dec 04 21:54:38 daemon info 71.19.142.20 gopher maxco=2 runco=0 toq=0 sc=1 mem=3035838
Notice how maxco (total number of coroutines) increments and stays that way after a failed request. And the evidence was pretty convincing too:
Dec 04 22:19:38 daemon info 71.19.142.20 gopher maxco=2 runco=0 toq=0 sc=1 mem=3411531 Dec 04 22:23:44 daemon info 71.19.142.20 gopher remote=XXXXXXXXXXXXXXX status=true request="Phlog:2010/03/08" bytes=189 Dec 04 22:24:38 daemon info 71.19.142.20 gopher maxco=2 runco=0 toq=0 sc=1 mem=3185119 Dec 04 22:24:39 daemon info 71.19.142.20 gopher remote=XXXXXXXXXXXXXXX status=false request="\3\0\0/*?\0\0\0\0\0Cookie: mstshash=Administr" bytes=82 Dec 04 22:25:57 daemon info 71.19.142.20 gopher remote=XXXXXXXXXXXXXXX status=true request="Phlog:2006/11/19.1" bytes=1028 Dec 04 22:29:38 daemon info 71.19.142.20 gopher maxco=3 runco=0 toq=0 sc=1 mem=3242133 Dec 04 22:33:31 daemon info 71.19.142.20 gopher remote=XXXXXXXXXXXXXXX status=true request="Phlog:" bytes=978 Dec 04 22:34:38 daemon info 71.19.142.20 gopher maxco=3 runco=0 toq=0 sc=1 mem=3207881
Error reporting with the gopher protocol is clearly an afterthought. The official RFC (Request For Comments) [2] has two occurances of the word “error” in it—and one of them is redundant. I did read somewhere (that's difficult to find now) that perhaps gopher should simple close the connection upon an error instead of sending an “error” to the client, so I thought I would try that. Instead of sending:
3Selector not foundHTfooHTgopher.conman.orgHT70CRLF
I would just close the connection.
That didn't work. The gopher server was still getting stuck. Attaching gdb to the stuck process didn't show anything, as the Lua [3] executable I was using didn't have debugging symbols. So then I recompiled Lua and the modules used that were written in C to include debugging information and restarted the server, yet again.
So now I think I think I found the root issue. Attaching gdb this time showed the server was stuck in LPEG (Lua Parsing Expression Grammars) [4]. Even better, I could see the text it was trying to parse and well … previously I said, “[t]he code hasn't changed since April.” That's not quite true. The server code hadn't changed since April, but an extension had! Back in late October [5] I modified the code that renders my blog on gopher to use Unicode combining characters to do some typographical tricks, and it seems that the code used to wrap the text just … wasn't up to par (Unicode is hard! Let's go to Mars!).
I also noticed that my Gemini server [6] had finally crashed—hard. And I changed that too, to use Unicode typographical tricks. So out it comes!
Let's see if this was the problem.
It totally was the problem. [7]
[2] https://www.ietf.org/rfc/rfc1436.txt
[4] http://www.inf.puc-rio.br/~roberto/lpeg/