2017-02-15 Load

What the hell happened around 18:00 yesterday?

Processes going up

Threads making a big jump

Munin taking minutes to run

CPU 100%

Load over 30

Was it an unusual amount of requests? I don’t think so:

alex@sibirocobombus:~$ sudo cat /var/log/apache2/access.log.1 | bin/time-grouping
      15/Feb/2017:06        426     1%
      15/Feb/2017:05        949     3%
      15/Feb/2017:04       1050     3%
      15/Feb/2017:03       1320     4%
      15/Feb/2017:02       1292     4%
      15/Feb/2017:01       1272     4%
      15/Feb/2017:00       1253     4%
      14/Feb/2017:23       1205     4%
      14/Feb/2017:22       1181     4%
      14/Feb/2017:21       1195     4%
      14/Feb/2017:20       1161     4%
      14/Feb/2017:19       1015     3%
      14/Feb/2017:18        965     3%
      14/Feb/2017:17       1121     4%
      14/Feb/2017:16       1129     4%
      14/Feb/2017:15       1022     3%
      14/Feb/2017:14        852     3%
      14/Feb/2017:13       1085     4%
      14/Feb/2017:12       1094     4%
      14/Feb/2017:11       1154     4%
      14/Feb/2017:10       1115     4%
      14/Feb/2017:09       1198     4%
      14/Feb/2017:08       1408     5%
      14/Feb/2017:07        973     3%
      14/Feb/2017:06        531     1%

Nothing unusual in the Toadfarm logs.

I noticed something weird in my Apache logs, though:

alex@sibirocobombus:~$ sudo cat /var/log/apache2/access.log.1 | bin/leech-detector | head
           IP Number       hits bandw. hits% interv. status code distrib.
           127.0.0.1       1590     4K   5%   54.3s  200 (99%), 301 (0%), 404 (0%)
      62.210.215.100       1255     2K   4%   68.7s  301 (33%), 200 (33%), 302 (32%), 503 (0%)
...

Who is 62.210.215.100 and what are they doing?

communitywiki.org:80 62.210.215.100 - - [15/Feb/2017:06:38:14 +0100] "GET /de?action=rss;rcidonly=OpenWLAN HTTP/1.1" 301 487 "-" "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.66 Safari/537.36"
communitywiki.org:443 62.210.215.100 - - [15/Feb/2017:06:38:14 +0100] "GET /de?action=rss;rcidonly=OpenWLAN HTTP/1.1" 302 4026 "-" "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.66 Safari/537.36"
communitywiki.org:443 62.210.215.100 - - [15/Feb/2017:06:38:14 +0100] "GET /wiki?action=rss;rcidonly=OpenWLAN HTTP/1.1" 200 4687 "-" "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.66 Safari/537.36"
communitywiki.org:80 62.210.215.100 - - [15/Feb/2017:06:39:45 +0100] "GET /en?action=rss;rcidonly=ConservationOfRationality HTTP/1.1" 301 521 "-" "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.66 Safari/537.36"
communitywiki.org:443 62.210.215.100 - - [15/Feb/2017:06:39:45 +0100] "GET /en?action=rss;rcidonly=ConservationOfRationality HTTP/1.1" 302 4060 "-" "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.66 Safari/537.36"
communitywiki.org:443 62.210.215.100 - - [15/Feb/2017:06:39:45 +0100] "GET /wiki?action=rss;rcidonly=ConservationOfRationality HTTP/1.1" 200 4698 "-" "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.66 Safari/537.36"
...

What we see here is that somebody is requesting various pages via HTTP (port 80) and getting redirected permanently (status 301) to HTTPS (port 443) and getting redirected temporarily (status 302) to a different URL (`/wiki` instead of `/en`). We also see that they are getting the URL in various languages.

Just to make sure, I added the following to my `.htaccess` file:

1. idiots downloading every single RSS file
order allow,deny
deny from 62.210.215.100
allow from all

And now they’re getting something else:

communitywiki.org:80 62.210.215.100 - - [15/Feb/2017:08:51:57 +0100] "GET /cw-fr?action=rss;rcidonly=LackOfReworking HTTP/1.1" 301 507 "-" "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.66 Safari/537.36"
communitywiki.org:443 62.210.215.100 - - [15/Feb/2017:08:51:57 +0100] "GET /cw-fr?action=rss;rcidonly=LackOfReworking HTTP/1.1" 403 3930 "-" "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.66 Safari/537.36"

Their HTTP request gets redirected to HTTPS and then they’re getting an access denied (status 403).

But wait, are they perhaps the cause for all these problems?

alex@sibirocobombus:~$ sudo cat /var/log/apache2/access.log.1 | grep 62.210.215.100 | bin/time-grouping
      15/Feb/2017:06         24     1%
      15/Feb/2017:05         60     4%
      15/Feb/2017:04         54     4%
      15/Feb/2017:03         44     3%
      15/Feb/2017:02         48     3%
      15/Feb/2017:01         48     3%
      15/Feb/2017:00         57     4%
      14/Feb/2017:23         78     6%
      14/Feb/2017:22         38     3%
      14/Feb/2017:21         41     3%
      14/Feb/2017:20         45     3%
      14/Feb/2017:19         42     3%
      14/Feb/2017:18         33     2%
      14/Feb/2017:17         65     5%
      14/Feb/2017:16         46     3%
      14/Feb/2017:15         69     5%
      14/Feb/2017:14         24     1%
      14/Feb/2017:13         45     3%
      14/Feb/2017:12         68     5%
      14/Feb/2017:11         62     4%
      14/Feb/2017:10         69     5%
      14/Feb/2017:09         63     5%
      14/Feb/2017:08         45     3%
      14/Feb/2017:07         48     3%
      14/Feb/2017:06         39     3%

Nope, they didn’t start yesterday at around 18:00.

Oh well. 😟

Restarted Toadfarm. Manually killed all the farm processes I found via `top`. Restarted Apache for good measure. I think we’re good.

I also checked `/var/log/syslog.1` and found nothing surprising around 18:00. Spamd is checking mails, Munin is running, more spam filtering, and that basically covers the time from 17:00 to 19:00.

Well, things are still not good.

top - 10:39:14 up 29 days, 23:18,  1 user,  load average: 5.26, 5.60, 5.69
Tasks:  66 total,   6 running,  57 sleeping,   0 stopped,   3 zombie
%Cpu(s): 98.7 us,  0.7 sy,  0.0 ni,  0.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:   4194304 total,  2157836 used,  2036468 free,        0 buffers
KiB Swap:  1048576 total,   480444 used,   568132 free.  1456564 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
18835 alex      20   0  217104  71452   2284 R  19.0  1.7  20:59.77 /home/alex/farm
17130 alex      20   0  216768  71140   2268 R  18.6  1.7  37:40.49 /home/alex/farm
18286 alex      20   0  217932  72364   2308 R  18.6  1.7  23:40.29 /home/alex/farm
19937 alex      20   0  218128  72652   2332 R  18.6  1.7  16:56.54 /home/alex/farm
22862 alex      20   0  218780  73208   2332 R  18.6  1.7   7:35.41 /home/alex/farm
16345 alex      20   0  209860  68980   3432 S   1.7  1.6   0:52.01 /home/alex/farm
16344 alex      20   0  212272  69720   2180 S   0.7  1.7   0:52.34 /home/alex/farm
16346 alex      20   0  208112  65936   2012 S   0.7  1.6   0:55.12 /home/alex/farm
16347 alex      20   0  210904  68176   2120 S   0.7  1.6   0:52.76 /home/alex/farm

And they keep coming back. I’m slowly taking sites offline to where the trouble lies.

​#Apache ​#Toadfarm ​#Administration

Comments

(Please contact me if you want to remove your comment.)

OK, I started disabling sites in my Toadfarm configuration until I only had two sites left: campaignwiki.org and oddmuse.org. A few hours I checked back and I had ten farm processes running using 10% CPU each. Still a problem!

Then I disabled campaignwiki.org and tried calling it from the command line. Mojolicious allows me to do that, just like the CGI module does, too. It turns out that I can run a few requests, but when I load RecentChanges, I get a timeout.

Here’s me getting a regular page:

alex@sibirocobombus:~/farm$ perl campaignwiki.pl get '/?action=browse&id=Hauptseite'
[Wed Feb 15 22:39:27 2017] [debug] GET "/"
[Wed Feb 15 22:39:27 2017] [debug] Routing to a callback
[Wed Feb 15 22:39:27 2017] [debug] [CGI:CODE(0x1506388):11783] START /home/alex/farm/wiki.pl
[Wed Feb 15 22:39:28 2017] [debug] 200 OK (0.32674s, 3.061/s)
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
...
</body>
</html>

Here’s me getting RecentChanges:

alex@sibirocobombus:~/farm$ perl campaignwiki.pl get '/RecentChanges'
[Wed Feb 15 22:39:42 2017] [debug] GET "/RecentChanges"
[Wed Feb 15 22:39:42 2017] [debug] Routing to a callback
[Wed Feb 15 22:39:43 2017] [debug] [CGI:CODE(0x1552388):11786] START /home/alex/farm/wiki.pl
[Wed Feb 15 22:39:43 2017] [debug] 200 OK (0.350602s, 2.852/s)
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
...
<div class="rc"><hr />[Wed Feb 15 22:39:58 2017] [debug] Inactivity timeout
[Wed Feb 15 22:40:03 2017] campaignwiki.pl: Problem loading URL "http://127.0.0.1:40810/RecentChanges": Inactivity timeout

When I check top, I see a perl process running eating all available CPU for every time I try to view RecentChanges and got an inactivity timeout. That’s suspiciously like the problem I’ve been experiencing.

So here’s the current status: All the web apps are back online except for campaignwiki.org.

– Alex Schroeder 2017-02-15 21:50 UTC

---

I added some debug warning messages and discovered that the infinite loop was somehow caused by StripRollbacks for the Gridmapper namespace. I looked at the rc.log file and found a single rollback entry. I removed the entire section and everything seems to work. So campaignwiki.org is up again and the bug remains a mystery.

It took me a while to get load back down

– Alex Schroeder 2017-02-15 22:16 UTC