I somehow managed to introduce a bug into this site’s configuration. It manifests as one of processes serving the site eating all the CPU resources. Every day there seems to be about zero to three of these.
As I can’t debug such rare events, I need to rely on logging but my efforts so far haven’t been successful. So now what I’m doing is that I’m disabling part of this site’s functionality, piece by piece. Let’s see whether I can pare it down to a set that no longer runs into the problem.
Currently disabled:
No longer disabled:
#Administration #Oddmuse
(Please contact me if you want to remove your comment.)
⁂
Can you not attached a debugger to the hung process to see where it’s stuck?
– Sean Conner 2023-06-28 19:50 UTC
---
I can attach gdb to the running Perl process, but I can’t get a Perl stacktrace, for example. Do you know how I’d do that?
– Alex 2023-06-28 20:38 UTC
---
You can call C functions of the target process from GDB, so if you know how to get a stack trace from some Perl internal function, but short of that ...
– Sean Conner 2023-06-29 02:43 UTC
---
App::Stacktrace sounds like it should do that, but runs into an error due to inaccessible memory even when running as root.
Maybe I can install a signal handler with Signal::StackTrace and write one to a file… That might be interesting. 🤔
– Alex 2023-06-29 06:07 UTC
---
The problem persists.
root@sibirocobombus:~# **ps aux | grep alexschroeder2** alex 246068 0.0 0.9 80088 58880 ? S Jun28 0:00 /home/alex/farm/alexschroeder2.pl alex 264093 0.0 0.6 58988 40908 ? S 00:00 0:26 /home/alex/farm/alexschroeder2.pl alex 278607 0.0 0.9 80052 56708 ? S 00:56 0:00 /home/alex/farm/alexschroeder2.pl alex 925386 0.0 1.0 80020 61416 ? S 03:06 0:00 /home/alex/farm/alexschroeder2.pl alex 976388 0.0 1.0 81288 62628 ? S 06:16 0:00 /home/alex/farm/alexschroeder2.pl alex 988390 0.0 1.0 81456 62604 ? S 07:04 0:00 /home/alex/farm/alexschroeder2.pl alex 1055069 0.0 1.0 85408 66704 ? S 11:15 0:00 /home/alex/farm/alexschroeder2.pl alex 1065132 0.0 1.0 80040 61368 ? S 11:57 0:00 /home/alex/farm/alexschroeder2.pl alex 1150350 0.0 1.0 79896 61044 ? S 17:17 0:00 /home/alex/farm/alexschroeder2.pl alex 1176184 0.9 0.7 62988 43300 ? S 18:51 0:15 /home/alex/farm/alexschroeder2.pl alex 1180038 0.9 0.7 62856 43216 ? S 19:05 0:06 /home/alex/farm/alexschroeder2.pl root 1183930 0.0 0.0 6332 2252 pts/0 S+ 19:17 0:00 grep alexschroeder2
The site should have two workers. Around midnight, it started climbing again.
– Alex 2023-06-29 17:24 UTC
---
OK, so here’s what I’ve done. I have a separate log, outside of Apache’s access.log. It logs PID, INFO_PATH and QUERY_STRING as soon as the config file is read. This happens for every request. Should I find a process hanging around, I can check this log to see what request it last served. At the same time, I installed `Signal::StackTrace` so I’m hoping that I can get a stacktrace into the log file by sending it SIGUSR1. I’ve tested it and it works. Let’s see whether it helps identify the tasks these processes get stuck on.
– Alex 2023-06-29 20:14 UTC
---
I hope it works.
– Sean Conner 2023-06-29 21:41 UTC
---
OK, so I’m coming back to this. I’m looking at a bunch of processes. The one process with the two children is the Hypnotoad with its two workers, I assume.
alex@sibirocobombus:~$ **ps axjf | grep alexschroeder2** 2055660 2058218 2058217 2055660 pts/0 2058217 S+ 1000 0:00 | \_ grep alexschroeder2 1 1250817 1226551 1226551 ? -1 S 1000 0:00 /home/alex/farm/alexschroeder2.pl 1 1261027 1226551 1226551 ? -1 S 1000 0:11 /home/alex/farm/alexschroeder2.pl 1261027 2049769 1226551 1226551 ? -1 R 1000 0:11 \_ /home/alex/farm/alexschroeder2.pl 1261027 2055621 1226551 1226551 ? -1 R 1000 0:03 \_ /home/alex/farm/alexschroeder2.pl 1 1902088 1226551 1226551 ? -1 S 1000 0:00 /home/alex/farm/alexschroeder2.pl 1 1913690 1226551 1226551 ? -1 S 1000 0:00 /home/alex/farm/alexschroeder2.pl 1 1947468 1226551 1226551 ? -1 R 1000 293:02 /home/alex/farm/alexschroeder2.pl 1 1982503 1226551 1226551 ? -1 S 1000 0:00 /home/alex/farm/alexschroeder2.pl 1 2003231 1226551 1226551 ? -1 S 1000 0:00 /home/alex/farm/alexschroeder2.pl 1 2044564 1226551 1226551 ? -1 S 1000 0:00 /home/alex/farm/alexschroeder2.pl 1 2047229 1226551 1226551 ? -1 S 1000 0:00 /home/alex/farm/alexschroeder2.pl
Picking the one that’s been running for hours:
alex@sibirocobombus:~/farm$ **grep 1947468 alexschroeder.log** [Fri Jun 30 04:28:05 2023] hypnotoad: 1947468 GET / action=rss;full=1;short=0;from=1685413682 alex@sibirocobombus:~/farm$ **perl -MDateTime -e '$dt=DateTime->from_epoch(epoch=>1685413682); print $dt->ymd, " ", $dt->hms, "\n";'** 2023-05-30 02:28:02
OK, getting the RSS for changes starting from the end of May. Why would anybody do that? Surely some bot running crazy.
root@sibirocobombus:~# **grep 1685413682 /var/log/apache2/access.log** alexschroeder.ch:443 178.209.50.237 - - [30/Jun/2023:04:28:04 +0200] "GET /wiki?action=rss;full=1;short=0;from=1685413682 HTTP/1.1" 502 5373 "-" "libwww-perl/6.51"
Indeed. Some Perl programmer. Thanks.
Let’s investigate some more, grepping for all the other PIDs.
[Fri Jun 30 01:52:09 2023] hypnotoad: 1902088 GET / [Fri Jun 30 02:34:40 2023] hypnotoad: 1913690 GET / [Fri Jun 30 06:24:16 2023] hypnotoad: 1982503 GET / [Fri Jun 30 07:44:13 2023] hypnotoad: 2003231 GET / [Fri Jun 30 10:15:46 2023] hypnotoad: 2044564 GET / [Fri Jun 30 10:28:58 2023] hypnotoad: 2047229 GET /
Now that’s disappointing. I was hoping they’d all run into the same problem: something about Recent Changes or the RSS feed. But no.
– Alex 2023-06-30 09:32 UTC
---
Weird, after some killing of processes there’s that process from yesterday:
root@sibirocobombus:~# **ps aux | grep alexschroeder2** alex 1250817 0.0 1.0 80204 60772 ? S Jun29 0:00 /home/alex/farm/alexschroeder2.pl alex 1261027 0.0 0.6 58964 42236 ? S 00:00 0:11 /home/alex/farm/alexschroeder2.pl alex 2061815 1.0 0.7 62972 43716 ? S 11:28 0:11 /home/alex/farm/alexschroeder2.pl alex 2066671 0.8 0.5 58964 35064 ? S 11:47 0:00 /home/alex/farm/alexschroeder2.pl root 2066695 0.0 0.0 6332 2168 pts/0 S+ 11:47 0:00 grep alexschroeder2
Just taking a peek with gdb:
root@sibirocobombus:~# **gdb -p 1250817** ... (gdb) **bt** #0 0x00007f6c4b48e0e0 in __GI___libc_write (fd=1, buf=0x564f20bd9cc0, nbytes=449) at ../sysdeps/unix/sysv/linux/write.c:26 #1 0x0000564f1df0fa9c in PerlIOUnix_write () #2 0x0000564f1df121d7 in PerlIOBuf_flush () #3 0x0000564f1df1101e in Perl_PerlIO_flush () #4 0x00007f6c4ad4b0b6 in PerlIOEncode_flush () from /home/alex/perl5/perlbrew/perls/perl-5.32.0/lib/5.32.0/x86_64-linux/auto/PerlIO/encoding/encoding.so #5 0x0000564f1df1101e in Perl_PerlIO_flush () #6 0x0000564f1de894d0 in Perl_pp_print () #7 0x0000564f1de86173 in Perl_runops_standard () #8 0x0000564f1ddff13c in perl_run () #9 0x0000564f1ddd7292 in main ()
Stuck writing?
Whaaaaat.
– Alex 2023-06-30 09:49 UTC
---
Unrelated: Perhaps time to temporarily reduce the mirroring to Gemini?
root@sibirocobombus:~# **journalctl --unit phoebe.service | sed -n -e 's/.*\[info\] Looking at gemini:\/\/\([^\/:]*\).*/\1/p' | sort | uniq -c** 1126 alexschroeder.ch 9 campaignwiki.org 1720 communitywiki.org 87 emacswiki.org 8 next.oddmuse.org 20 toki.transjovian.org 828 transjovian.org 8 vault.transjovian.org 1 xn--vxagggm5c.transjovian.org
I somehow feel that perhaps serving communitywiki.org over Gemini is a mistake, for example. For the moment I have removed alexschroeder.ch, communitywiki.org, campaignwiki.org and emacswiki.org from the list.
– Alex 2023-06-30 09:56 UTC
---
I can verify that the RSS command kills it.
The problem is in OddMuse::StripRollbacks line 1515. In the eval on line 1550, to be precise. Interesting. Time to copy that log including the rollbacks to a local test environment.
– Alex 2023-06-30 10:13 UTC
---
That rollback issue seems to be fixed in commit a09c84.
– Alex 2023-06-30 11:05 UTC
---
The Gemini hunch… I was just looking at PID 2070991. The log told me it was looking at the main page. The Stacktrace said it was here:
It’s possible that the GeminiRule is borked. Let’s keep an eye out for this.
– Alex 2023-06-30 11:23 UTC
---
I’ve found another one, PID 2145009. No children, PPID 1. Running `kill -s SIGUSR1 2145009` and checking `farm/alexschroeder.log` shows that it’s running:
I’m starting to think that these are processes that aren’t getting a signal, like SIGPIPE or whatever they should be getting when their output no longer goes anywhere.
Looking at the logs:
root@sibirocobombus:~# **grep 2145009 /home/alex/farm/alexschroeder.log** [Fri Jun 30 16:21:57 2023] hypnotoad: 2145009 GET / root@sibirocobombus:~# **grep "alexschroeder.*16:21:5.*GET /wiki" /var/log/apache2/access.log** alexschroeder.ch:443 35.76.226.70 - - [30/Jun/2023:16:21:51 +0200] "GET /wiki/feed/full/RPG HTTP/1.1" 502 5425 "-" "python-requests/2.28.2" alexschroeder.ch:443 178.209.50.237 - - [30/Jun/2023:16:21:56 +0200] "GET /wiki HTTP/1.1" 502 5373 "-" "Monit/5.33.0"
Sounds like it’s the Monit request that made it crash!
Well, I can definitely have Monit check a URL with a much shorter reply, that’s for sure.
– Alex 2023-06-30 15:12 UTC
---
If you are on Linux, you can check /proc//status to see if SIGPIPE is being caught or not (you’ll have to find the number for SIGPIPE, then count bits, but it’s possible). Some code bases will ignore SIGPIPE (like some of mine) because ... signals are a pain to deal with sanely.
– Sean Conner 2023-06-30 22:12 UTC
---
Thanks for the info!
In the mean time, I’m currently on a trip, so not much time.
alex@sibirocobombus:~$ **ps axjf | grep alexschroeder2** 197719 197743 197742 197719 pts/0 197742 S+ 1000 0:00 \_ grep alexschroeder2 1 3134541 1226551 1226551 ? -1 R 1000 1420:28 /home/alex/farm/alexschroeder2.pl 1 3207068 3136967 3136967 ? -1 S 1000 0:00 /home/alex/farm/alexschroeder2.pl 1 3248096 3136967 3136967 ? -1 S 1000 0:00 /home/alex/farm/alexschroeder2.pl 1 4132545 3136967 3136967 ? -1 R 1000 369:25 /home/alex/farm/alexschroeder2.pl 1 4134899 4134899 4134899 ? -1 Ss 1000 0:13 /home/alex/farm/alexschroeder2.pl 4134899 187823 4134899 4134899 ? -1 R 1000 0:13 \_ /home/alex/farm/alexschroeder2.pl 4134899 196398 4134899 4134899 ? -1 R 1000 0:02 \_ /home/alex/farm/alexschroeder2.pl 1 75274 4134899 4134899 ? -1 S 1000 0:02 /home/alex/farm/alexschroeder2.pl 1 101555 4134899 4134899 ? -1 S 1000 0:00 /home/alex/farm/alexschroeder2.pl
Clearly, 3134541 is not cool. It’s busy doing something regarding changes.
alex@sibirocobombus:~$ **grep 3134541 /home/alex/farm/alexschroeder.log.1** [Sat Jul 1 04:28:02 2023] hypnotoad: 3134541 GET / action=rss;full=1;short=0;from=1685413682
It also doesn’t react to SIGUSR1. Or so I thought. But apparently it’s standard error remains connected to the `alexschroeder.log.1` log file instead of always going to the “current” log file. Level 2 of the stack is our friend `OddMuse::StripRollbacks`. Now I’m wondering: did I install the changes I made? I don’t think so.
Time to remedy that.
– Alex 2023-07-02 19:22 UTC
---
The problem is still not solved entirely. A few days ago, the number of processes jumped from 3 to 5.
alex@sibirocobombus:~/farm$ **ps axjf | grep alexschroeder2** 3322360 3323606 3323605 3322360 pts/0 3323605 S+ 1000 0:00 \_ grep alexschroeder2 1 703435 4134899 4134899 ? -1 S 1000 0:00 /home/alex/farm/alexschroeder2.pl 1 703439 4134899 4134899 ? -1 S 1000 0:00 /home/alex/farm/alexschroeder2.pl 1 2457442 4134899 4134899 ? -1 S 1000 0:21 /home/alex/farm/alexschroeder2.pl 2457442 2457448 4134899 4134899 ? -1 S 1000 9:09 \_ /home/alex/farm/alexschroeder2.pl 2457442 2457449 4134899 4134899 ? -1 S 1000 9:10 \_ /home/alex/farm/alexschroeder2.pl alex@sibirocobombus:~/farm$ **zgrep -a 703435 alexschroeder.log*** alexschroeder.log.2.gz:[Sat Jul 15 09:08:55 2023] hypnotoad: 703435 GET /Emacs alex@sibirocobombus:~/farm$ **zgrep -a 703439 alexschroeder.log*** alexschroeder.log.2.gz:[Sat Jul 15 09:10:18 2023] hypnotoad: 703439 GET /Knives alex@sibirocobombus:~/farm$ **kill 703435** alex@sibirocobombus:~/farm$ **kill 703439**
All I can say is that it doesn’t look like there’s anything particular going on… Emacs and Knives are legitimate requests to make.
These requests also don’t all happen at the same time, so it’s not that there is a particularly vulnerable time period.
Graph showing the jump in number of processes
As we can see in the graph, there was a cleanup on July 9 where I killed all the extra processes. The number stayed at 3 until July 15 where the number jumps up to 5. Today is July 17 and the number is still at 5.
The only thing I know is that July 15 appears to have been a busy day.
July 15 CPU usage is above average
July 15 memory spike with cache all but disappearing at one point
– Alex 2023-07-17 14:13 UTC