2018-08-24 Out of Memory

I’m not sure what to make of this. At some point this evening I was alerted to the fact that all my sites were down. When I came back home, I checked the log files and saw that the first `oom-killer` (oom means *out of memory*) invokation in `syslog` happened at 17:00:

Aug 24 17:00:18 sibirocobombus kernel: [12013817.356181] /home/alex/farm invoked oom-killer: gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=0, order=0, oom_score_adj=0

And from then on, it just kept bombing. The last one was killed about an hour later:

Aug 24 18:10:25 sibirocobombus kernel: [12018023.798769] Killed process 10775 (/home/alex/farm) total-vm:338508kB, anon-rss:127416kB, file-rss:0kB, shmem-rss:0kB

And from then on, everything was down. Apache was still running but the backend was not:

[Fri Aug 24 23:08:51.927171 2018] [proxy:error] [pid 3594:tid 139759988967168] (111)Connection refused: AH00957: HTTP: attempt to connect to 178.209.50.237:8080 (campaignwiki.org) failed

The backend are my wikis and web applications using the *Mojolicious* framework, running under *Toadfarm*. That’s the `/home/alex/farm` in those out of memory messages above.

When I checked using `ps aux`, I saw many `farm` processes. I ended up using something like this:

for pid in $(ps -xo pid,command|grep 'farm


|cut -d ' ' -f 1); do kill -9 $pid; done

And for those lines starting with a space (why are you doing this to me, `ps`?), one more time:

for pid in $(ps -xo pid,command|grep 'farm


|cut -d ' ' -f 2); do kill -9 $pid; done

And I still don’t see anything suspicious in my log files.

Memory

CPU

Load

Clearly, something happened between 16:00 and 18:00! Let’s check IP numbers:

$ sudo grep 24/Aug/2018:1[67]: /var/log/apache2/access.log|bin/leech-detector|head
      IP Number  hits bandw. hits% interv. status code distrib.

	**.***.***.***   432     5K   4%    0.8s  502 (36%), 500 (26%), 200 (18%), …
	**.***.***.***   136     5K   1%   53.0s  200 (97%), 301 (1%), 404 (1%)
	**.***.***.***   115     3K   1%   60.4s  301 (37%), 302 (31%), 200 (31%)
	**.***.***.***    91     9K   0%   79.4s  200 (75%), 404 (15%), 302 (4%), …
	**.***.***.***    88     5K   0%   79.5s  301 (54%), 200 (36%), 404 (9%)
	**.***.***.***    86     9K   0%   82.7s  200 (63%), 301 (16%), 404 (12%), …
	**.***.***.***    86     2K   0%   83.2s  200 (50%), 301 (50%)
	**.***.***.***    73     8K   0%   94.6s  200 (49%), 301 (28%), 404 (19%), …
	**.***.***.***    68     5K   0%  104.6s  301 (47%), 200 (36%), 302 (10%), …

OK, now we’re getting closer. My `robots.txt` says, amongst many other things:

User-agent: *
Crawl-delay: 20

So an average interval of less than a second between requests is clearly an idiot. Who might it be? `whois` says “Vultr Holdings, LLC”. An ominous name!

Grepping the access log for the IP number in question shows that they were basically trying to download `communitywiki.org`.

Their first request?

[24/Aug/2018:16:59:55 +0200] "GET /wiki/BannedHosts HTTP/1.1" 200 7180 "https://communitywiki.org/wiki/BannedHosts" "Pcore-HTTP/v0.44.0"

Notice that this was just seconds before it all came crashing down!

The first HTTP 503 status code (”Service Unavailable”) is right there, in the same second:

[24/Aug/2018:16:59:55 +0200] "GET /wiki?action=reindex HTTP/1.1" 503 5573 "https://communitywiki.org/wiki?action=reindex" "Pcore-HTTP/v0.44.0"

From then on, there will be more and more server errors. But they don’t care...

Pcore is a Perl framework.

Pcore

Stop gap measure for Apache:

RewriteCond "%{HTTP_USER_AGENT}" "Pcore"
RewriteRule ".*" "-" [redirect=403,last]

I am still astonished about the speed. Something about my setup is not cool. I am particularly annoyed because my Monit setup didn’t allow me to recover from this situation. And I’m annoyed because these idiots stopped hitting the site at 17:05:33 and yet my setup wasn’t able to recover. It went ballistic and died *an hour later*. Not good.

Any ideas? What would you do?

#Web #Administration

Comments

(Please contact me if you want to remove your comment.)

⁂

I guess if I wanted to have a solution that also worked for Gopher (see 2018-07-15 Russian Gopher Idiots at Work), I cannot rely on solutions based on Apache, tempting as they are. It would have to be some sort of firewall. @sheogorath recommended taking another look at `fail2ban` (which I already use), so that’s promising. Must look at `/etc/fail2ban/filter.d/`, tomorrow.

2018-07-15 Russian Gopher Idiots at Work

solutions based on Apache

@sheogorath

Here’s their toot on it:

their toot

The `fail2ban` daemon watches specified logs and log files for changes (like you can do with `journalctl -f`) and passes the content to a regex. These regex’ are specified in `/etc/fail2ban/filter.d/`.

You only have to enable them in `/etc/fail2ban/jail.local` or `/etc/fail2ban/jail.d`.

Given that a regex matches the specific number of times in the specified time, it’ll trigger an action, like ban the ip in the firewall. These are specified in `/etc/fail2ban/action.d`.

– Alex Schroeder 2018-08-24 22:45 UTC

---

OK, first attempt...

This is the filter for my Apache access.log file, stored in `/etc/fail2ban/filter.d/alex-apache.conf`:

# Author: Alex Schroeder <alex@gnu.org>

[Definition]
# ANY match in the logfile counts!
# communitywiki.org:443 000.000.000.000 - - [24/Aug/2018:16:59:55 +0200] "GET /wiki/BannedHosts HTTP/1.1" 200 7180 "https://communitywiki.org/wiki\
/BannedHosts" "Pcore-HTTP/v0.44.0"
failregex = ^[^:]+:[0-9]+ <HOST>

# Except cgit!
# alexschroeder.ch:443 0:0:0:0:0:0:0:0 - - [28/Aug/2018:09:14:39 +0200] "GET /cgit/bitlbee-mastodon/objects/9b/ff0c237ace5569aa348f6b12b3c2f95e07fd0d HTTP/1.1" 200 3308 "-" "git/2.18.0"
ignoreregex = ^[^"]*"GET /cgit

This is the filter for the log file of my gopher server, stored in `/etc/fail2ban/filter.d/alex-gopher.conf`:

# Author: Alex Schroeder <alex@gnu.org>

[Init]
# 2018/08/25-09:08:55 CONNECT TCP Peer: "[000.000.000.000]:56281" Local: "[000.000.000.000]:70"
datepattern = ^%%Y/%%m/%%d-%%H:%%M:%%S

[Definition]
# ANY match in the logfile counts!
failregex = CONNECT TCP Peer: "\[<HOST>\]:\d+"

And this sets up a few jails using these filters, stored in `/etc/fail2ban/jail.d/alex.conf`:

[alex-apache]
enabled = true
port    = http,https
logpath = %(apache_access_log)s
findtime = 40
maxretry = 20

[alex-gopher]
enabled = true
port    = 70
logpath = /home/alex/farm/gopher-server.log
findtime = 40
maxretry = 20

[alex-gopher-ssl]
enabled = true
filter  = alex-gopher
port    = 7443
logpath = /home/alex/farm/gopher-server-ssl.log
findtime = 40
maxretry = 20

Reload and show status:

# fail2ban-client reload
# fail2ban-client status
Status
|- Number of jail:	4
`- Jail list:	alex-apache, alex-gopher, alex-gopher-ssl, sshd

We’ll see how it goes!

– Alex Schroeder 2018-08-25 07:35 UTC

---

Well, the situation is pretty much under control, now. I see that `fail2ban` has between zero and four hosts banned. That looks reasonable, actually.

– Alex Schroeder 2018-08-26 08:07 UTC

---

The old `fail2ban` which comes with Debian stable doesn’t support IPv6 addresses but the newer one does.

1. search for fail2ban package

2. follow the links to a version 0.10 or better

3. download file (in my case that meant `wget http://ftp.ch.debian.org/debian/pool/main/f/fail2ban/fail2ban_0.10.2-2_all.deb`)

4. install (in my case using `dpkg -i fail2ban_0.10.2-2_all.deb`)

search for fail2ban

I have made one change to my `/etc/fail2ban/fail2ban.conf`: I just want to know about banning and unbanning, not about all the IP numbers seen.

loglevel = NOTICE

– Alex Schroeder 2018-08-27 11:28 UTC

---

I’ve had to fiddle with the filter again and again. Loading the Munin page, for example, generated enough requests for instant banning. So now I’m trying to find a good place to be where most simple file aren’t considered for banning. This is my current `/etc/fail2ban/filter.d/alex-apache.conf`.

# Author: Alex Schroeder <alex@gnu.org>

[Definition]
# ANY match in the logfile counts!
# communitywiki.org:443 000.000.000.000 - - [24/Aug/2018:16:59:55 +0200] "GET /wiki/BannedHosts HTTP/1.1" 200 7180 "https://communitywiki.org/wiki/BannedHosts" "Pcore-HTTP/v0.44.0"
failregex = ^[^:]+:[0-9]+ <HOST>

# Except cgit, css files, images...
# alexschroeder.ch:443 0:0:0:0:0:0:0:0 - - [28/Aug/2018:09:14:39 +0200] "GET /cgit/bitlbee-mastodon/objects/9b/ff0c237ace5569aa348f6b12b3c2f95e07fd0d HTTP/1.1" 200 3308 "-" "git/2.18.0"
ignoreregex = ^[^"]*"GET /(robots\.txt |favicon\.ico |[^/ ]+.(css|js) |cgit/|css/|fonts/|pics/|1pdc/|gallery/|static/|munin/)

– Alex Schroeder 2018-09-04 10:25 UTC