2023-06-28 Reduced Site Functionality

I somehow managed to introduce a bug into this site’s configuration. It manifests as one of processes serving the site eating all the CPU resources. Every day there seems to be about zero to three of these.

As I can’t debug such rare events, I need to rely on logging but my efforts so far haven’t been successful. So now what I’m doing is that I’m disabling part of this site’s functionality, piece by piece. Let’s see whether I can pare it down to a set that no longer runs into the problem.

Currently disabled:

alexschroeder.ch, communitywiki.org, campaignwiki.org and emacswiki.org aren’t being served via Gemini

No longer disabled:

More links
PDF generation
RSS feeds
Recent Changes options (the Changes link at the top still works)

#Administration #Oddmuse

Comments

(Please contact me if you want to remove your comment.)

⁂

Can you not attached a debugger to the hung process to see where it’s stuck?

– Sean Conner 2023-06-28 19:50 UTC

Sean Conner

---

I can attach gdb to the running Perl process, but I can’t get a Perl stacktrace, for example. Do you know how I’d do that?

– Alex 2023-06-28 20:38 UTC

---

You can call C functions of the target process from GDB, so if you know how to get a stack trace from some Perl internal function, but short of that ...

– Sean Conner 2023-06-29 02:43 UTC

Sean Conner

---

App::Stacktrace sounds like it should do that, but runs into an error due to inaccessible memory even when running as root.

App::Stacktrace

Maybe I can install a signal handler with Signal::StackTrace and write one to a file… That might be interesting. 🤔

Signal::StackTrace

– Alex 2023-06-29 06:07 UTC

---

The problem persists.

root@sibirocobombus:~# **ps aux | grep alexschroeder2**
alex      246068  0.0  0.9  80088 58880 ?        S    Jun28   0:00 /home/alex/farm/alexschroeder2.pl
alex      264093  0.0  0.6  58988 40908 ?        S    00:00   0:26 /home/alex/farm/alexschroeder2.pl
alex      278607  0.0  0.9  80052 56708 ?        S    00:56   0:00 /home/alex/farm/alexschroeder2.pl
alex      925386  0.0  1.0  80020 61416 ?        S    03:06   0:00 /home/alex/farm/alexschroeder2.pl
alex      976388  0.0  1.0  81288 62628 ?        S    06:16   0:00 /home/alex/farm/alexschroeder2.pl
alex      988390  0.0  1.0  81456 62604 ?        S    07:04   0:00 /home/alex/farm/alexschroeder2.pl
alex     1055069  0.0  1.0  85408 66704 ?        S    11:15   0:00 /home/alex/farm/alexschroeder2.pl
alex     1065132  0.0  1.0  80040 61368 ?        S    11:57   0:00 /home/alex/farm/alexschroeder2.pl
alex     1150350  0.0  1.0  79896 61044 ?        S    17:17   0:00 /home/alex/farm/alexschroeder2.pl
alex     1176184  0.9  0.7  62988 43300 ?        S    18:51   0:15 /home/alex/farm/alexschroeder2.pl
alex     1180038  0.9  0.7  62856 43216 ?        S    19:05   0:06 /home/alex/farm/alexschroeder2.pl
root     1183930  0.0  0.0   6332  2252 pts/0    S+   19:17   0:00 grep alexschroeder2

The site should have two workers. Around midnight, it started climbing again.

Image 1

– Alex 2023-06-29 17:24 UTC

---

OK, so here’s what I’ve done. I have a separate log, outside of Apache’s access.log. It logs PID, INFO_PATH and QUERY_STRING as soon as the config file is read. This happens for every request. Should I find a process hanging around, I can check this log to see what request it last served. At the same time, I installed `Signal::StackTrace` so I’m hoping that I can get a stacktrace into the log file by sending it SIGUSR1. I’ve tested it and it works. Let’s see whether it helps identify the tasks these processes get stuck on.

– Alex 2023-06-29 20:14 UTC

---

I hope it works.

– Sean Conner 2023-06-29 21:41 UTC

Sean Conner

---

OK, so I’m coming back to this. I’m looking at a bunch of processes. The one process with the two children is the Hypnotoad with its two workers, I assume.

alex@sibirocobombus:~$ **ps axjf | grep alexschroeder2**
2055660 2058218 2058217 2055660 pts/0    2058217 S+    1000   0:00  |           \_ grep alexschroeder2
      1 1250817 1226551 1226551 ?             -1 S     1000   0:00 /home/alex/farm/alexschroeder2.pl
      1 1261027 1226551 1226551 ?             -1 S     1000   0:11 /home/alex/farm/alexschroeder2.pl
1261027 2049769 1226551 1226551 ?             -1 R     1000   0:11  \_ /home/alex/farm/alexschroeder2.pl
1261027 2055621 1226551 1226551 ?             -1 R     1000   0:03  \_ /home/alex/farm/alexschroeder2.pl
      1 1902088 1226551 1226551 ?             -1 S     1000   0:00 /home/alex/farm/alexschroeder2.pl
      1 1913690 1226551 1226551 ?             -1 S     1000   0:00 /home/alex/farm/alexschroeder2.pl
      1 1947468 1226551 1226551 ?             -1 R     1000 293:02 /home/alex/farm/alexschroeder2.pl
      1 1982503 1226551 1226551 ?             -1 S     1000   0:00 /home/alex/farm/alexschroeder2.pl
      1 2003231 1226551 1226551 ?             -1 S     1000   0:00 /home/alex/farm/alexschroeder2.pl
      1 2044564 1226551 1226551 ?             -1 S     1000   0:00 /home/alex/farm/alexschroeder2.pl
      1 2047229 1226551 1226551 ?             -1 S     1000   0:00 /home/alex/farm/alexschroeder2.pl

Picking the one that’s been running for hours:

alex@sibirocobombus:~/farm$ **grep 1947468 alexschroeder.log**
[Fri Jun 30 04:28:05 2023] hypnotoad: 1947468 GET / action=rss;full=1;short=0;from=1685413682
alex@sibirocobombus:~/farm$ **perl -MDateTime -e '$dt=DateTime->from_epoch(epoch=>1685413682); print $dt->ymd, " ",  $dt->hms, "\n";'**
2023-05-30 02:28:02

OK, getting the RSS for changes starting from the end of May. Why would anybody do that? Surely some bot running crazy.

root@sibirocobombus:~# **grep 1685413682 /var/log/apache2/access.log**
alexschroeder.ch:443 178.209.50.237 - - [30/Jun/2023:04:28:04 +0200] "GET /wiki?action=rss;full=1;short=0;from=1685413682 HTTP/1.1" 502 5373 "-" "libwww-perl/6.51"

Indeed. Some Perl programmer. Thanks.

Let’s investigate some more, grepping for all the other PIDs.

[Fri Jun 30 01:52:09 2023] hypnotoad: 1902088 GET /
[Fri Jun 30 02:34:40 2023] hypnotoad: 1913690 GET /
[Fri Jun 30 06:24:16 2023] hypnotoad: 1982503 GET /
[Fri Jun 30 07:44:13 2023] hypnotoad: 2003231 GET /
[Fri Jun 30 10:15:46 2023] hypnotoad: 2044564 GET /
[Fri Jun 30 10:28:58 2023] hypnotoad: 2047229 GET /

Now that’s disappointing. I was hoping they’d all run into the same problem: something about Recent Changes or the RSS feed. But no.

– Alex 2023-06-30 09:32 UTC

---

Weird, after some killing of processes there’s that process from yesterday:

root@sibirocobombus:~# **ps aux | grep alexschroeder2**
alex     1250817  0.0  1.0  80204 60772 ?        S    Jun29   0:00 /home/alex/farm/alexschroeder2.pl
alex     1261027  0.0  0.6  58964 42236 ?        S    00:00   0:11 /home/alex/farm/alexschroeder2.pl
alex     2061815  1.0  0.7  62972 43716 ?        S    11:28   0:11 /home/alex/farm/alexschroeder2.pl
alex     2066671  0.8  0.5  58964 35064 ?        S    11:47   0:00 /home/alex/farm/alexschroeder2.pl
root     2066695  0.0  0.0   6332  2168 pts/0    S+   11:47   0:00 grep alexschroeder2

Just taking a peek with gdb:

root@sibirocobombus:~# **gdb -p 1250817**
...
(gdb) **bt**
#0  0x00007f6c4b48e0e0 in __GI___libc_write (fd=1, buf=0x564f20bd9cc0, nbytes=449) at ../sysdeps/unix/sysv/linux/write.c:26
#1  0x0000564f1df0fa9c in PerlIOUnix_write ()
#2  0x0000564f1df121d7 in PerlIOBuf_flush ()
#3  0x0000564f1df1101e in Perl_PerlIO_flush ()
#4  0x00007f6c4ad4b0b6 in PerlIOEncode_flush ()
   from /home/alex/perl5/perlbrew/perls/perl-5.32.0/lib/5.32.0/x86_64-linux/auto/PerlIO/encoding/encoding.so
#5  0x0000564f1df1101e in Perl_PerlIO_flush ()
#6  0x0000564f1de894d0 in Perl_pp_print ()
#7  0x0000564f1de86173 in Perl_runops_standard ()
#8  0x0000564f1ddff13c in perl_run ()
#9  0x0000564f1ddd7292 in main ()

Stuck writing?

Whaaaaat.

– Alex 2023-06-30 09:49 UTC

---

Unrelated: Perhaps time to temporarily reduce the mirroring to Gemini?

root@sibirocobombus:~# **journalctl --unit phoebe.service | sed -n -e 's/.*\[info\] Looking at gemini:\/\/\([^\/:]*\).*/\1/p' | sort | uniq -c**
   1126 alexschroeder.ch
      9 campaignwiki.org
   1720 communitywiki.org
     87 emacswiki.org
      8 next.oddmuse.org
     20 toki.transjovian.org
    828 transjovian.org
      8 vault.transjovian.org
      1 xn--vxagggm5c.transjovian.org

I somehow feel that perhaps serving communitywiki.org over Gemini is a mistake, for example. For the moment I have removed alexschroeder.ch, communitywiki.org, campaignwiki.org and emacswiki.org from the list.

– Alex 2023-06-30 09:56 UTC

---

I can verify that the RSS command kills it.

The problem is in OddMuse::StripRollbacks line 1515. In the eval on line 1550, to be precise. Interesting. Time to copy that log including the rollbacks to a local test environment.

– Alex 2023-06-30 10:13 UTC

---

That rollback issue seems to be fixed in commit a09c84.

– Alex 2023-06-30 11:05 UTC

---

The Gemini hunch… I was just looking at PID 2070991. The log told me it was looking at the main page. The Stacktrace said it was here:

OddMuse::ApplyRules
OddMuse::PrintJournal (correct, the main page has the ten most recent pages)
OddMuse::PrintAllPages (correct, with a filter)
OddMuse::PrintPageHtml (correct, for each page)
OddMuse::PrintCache (correct, cached HTML exists)
OddMuse::ApplyRules (possible, cached HTML contains “dirty” blocks that need to be rendered)
OddMuse::RunMyRules (OK, config file stuff)
OddMuse::GeminiRule (Oooops… getting close)
OddMuse::ApplyRules (recursively…)
somewhere on line 532 where it prints stuff

It’s possible that the GeminiRule is borked. Let’s keep an eye out for this.

– Alex 2023-06-30 11:23 UTC

---

I’ve found another one, PID 2145009. No children, PPID 1. Running `kill -s SIGUSR1 2145009` and checking `farm/alexschroeder.log` shows that it’s running:

OddMuse::PrintAllPages
OddMuse::PrintPageHtml
OddMuse::PrintCache
somewhere on line 1238 where it prints stuff

I’m starting to think that these are processes that aren’t getting a signal, like SIGPIPE or whatever they should be getting when their output no longer goes anywhere.

Looking at the logs:

root@sibirocobombus:~# **grep 2145009 /home/alex/farm/alexschroeder.log**
[Fri Jun 30 16:21:57 2023] hypnotoad: 2145009 GET /
root@sibirocobombus:~# **grep "alexschroeder.*16:21:5.*GET /wiki" /var/log/apache2/access.log**
alexschroeder.ch:443 35.76.226.70 - - [30/Jun/2023:16:21:51 +0200] "GET /wiki/feed/full/RPG HTTP/1.1" 502 5425 "-" "python-requests/2.28.2"
alexschroeder.ch:443 178.209.50.237 - - [30/Jun/2023:16:21:56 +0200] "GET /wiki HTTP/1.1" 502 5373 "-" "Monit/5.33.0"

Sounds like it’s the Monit request that made it crash!

Well, I can definitely have Monit check a URL with a much shorter reply, that’s for sure.

– Alex 2023-06-30 15:12 UTC

---

If you are on Linux, you can check /proc//status to see if SIGPIPE is being caught or not (you’ll have to find the number for SIGPIPE, then count bits, but it’s possible). Some code bases will ignore SIGPIPE (like some of mine) because ... signals are a pain to deal with sanely.

– Sean Conner 2023-06-30 22:12 UTC

Sean Conner

---

Thanks for the info!

In the mean time, I’m currently on a trip, so not much time.

alex@sibirocobombus:~$ **ps axjf | grep alexschroeder2**
 197719  197743  197742  197719 pts/0     197742 S+    1000   0:00              \_ grep alexschroeder2
      1 3134541 1226551 1226551 ?             -1 R     1000 1420:28 /home/alex/farm/alexschroeder2.pl
      1 3207068 3136967 3136967 ?             -1 S     1000   0:00 /home/alex/farm/alexschroeder2.pl
      1 3248096 3136967 3136967 ?             -1 S     1000   0:00 /home/alex/farm/alexschroeder2.pl
      1 4132545 3136967 3136967 ?             -1 R     1000 369:25 /home/alex/farm/alexschroeder2.pl
      1 4134899 4134899 4134899 ?             -1 Ss    1000   0:13 /home/alex/farm/alexschroeder2.pl
4134899  187823 4134899 4134899 ?             -1 R     1000   0:13  \_ /home/alex/farm/alexschroeder2.pl
4134899  196398 4134899 4134899 ?             -1 R     1000   0:02  \_ /home/alex/farm/alexschroeder2.pl
      1   75274 4134899 4134899 ?             -1 S     1000   0:02 /home/alex/farm/alexschroeder2.pl
      1  101555 4134899 4134899 ?             -1 S     1000   0:00 /home/alex/farm/alexschroeder2.pl

Clearly, 3134541 is not cool. It’s busy doing something regarding changes.

alex@sibirocobombus:~$ **grep 3134541 /home/alex/farm/alexschroeder.log.1**
[Sat Jul  1 04:28:02 2023] hypnotoad: 3134541 GET / action=rss;full=1;short=0;from=1685413682

It also doesn’t react to SIGUSR1. Or so I thought. But apparently it’s standard error remains connected to the `alexschroeder.log.1` log file instead of always going to the “current” log file. Level 2 of the stack is our friend `OddMuse::StripRollbacks`. Now I’m wondering: did I install the changes I made? I don’t think so.

Time to remedy that.

– Alex 2023-07-02 19:22 UTC

---

The problem is still not solved entirely. A few days ago, the number of processes jumped from 3 to 5.

alex@sibirocobombus:~/farm$ **ps axjf | grep alexschroeder2**
3322360 3323606 3323605 3322360 pts/0    3323605 S+    1000   0:00              \_ grep alexschroeder2
      1  703435 4134899 4134899 ?             -1 S     1000   0:00 /home/alex/farm/alexschroeder2.pl
      1  703439 4134899 4134899 ?             -1 S     1000   0:00 /home/alex/farm/alexschroeder2.pl
      1 2457442 4134899 4134899 ?             -1 S     1000   0:21 /home/alex/farm/alexschroeder2.pl
2457442 2457448 4134899 4134899 ?             -1 S     1000   9:09  \_ /home/alex/farm/alexschroeder2.pl
2457442 2457449 4134899 4134899 ?             -1 S     1000   9:10  \_ /home/alex/farm/alexschroeder2.pl
alex@sibirocobombus:~/farm$ **zgrep -a 703435 alexschroeder.log***
alexschroeder.log.2.gz:[Sat Jul 15 09:08:55 2023] hypnotoad: 703435 GET /Emacs
alex@sibirocobombus:~/farm$ **zgrep -a 703439 alexschroeder.log***
alexschroeder.log.2.gz:[Sat Jul 15 09:10:18 2023] hypnotoad: 703439 GET /Knives
alex@sibirocobombus:~/farm$ **kill 703435**
alex@sibirocobombus:~/farm$ **kill 703439**

All I can say is that it doesn’t look like there’s anything particular going on… Emacs and Knives are legitimate requests to make.

Emacs

Knives

These requests also don’t all happen at the same time, so it’s not that there is a particularly vulnerable time period.

Graph showing the jump in number of processes

As we can see in the graph, there was a cleanup on July 9 where I killed all the extra processes. The number stayed at 3 until July 15 where the number jumps up to 5. Today is July 17 and the number is still at 5.

The only thing I know is that July 15 appears to have been a busy day.

July 15 CPU usage is above average

July 15 load spike

July 15 memory spike with cache all but disappearing at one point

– Alex 2023-07-17 14:13 UTC