The mysteries of problems regarding my website continues. What the hell happened?
Memory for “apps” went up to 16G at around 21:00 and then almost everything was killed.
Let’s see what `grep oom-killer /var/log/syslog` has to say:
Nov 6 21:05:03 sibirocobombus kernel: [18425702.447121] /home/alex/farm invoked oom-killer: ... Nov 6 21:05:09 sibirocobombus kernel: [18425708.244062] munin-update invoked oom-killer: ... Nov 6 21:05:09 sibirocobombus kernel: [18425708.925423] /home/alex/farm invoked oom-killer: ... Nov 6 21:05:13 sibirocobombus kernel: [18425712.660969] apache2 invoked oom-killer: ... Nov 6 21:05:14 sibirocobombus kernel: [18425713.857753] fail2ban-server invoked oom-killer: ... Nov 6 21:05:15 sibirocobombus kernel: [18425714.904769] fw_conntrack invoked oom-killer: ... Nov 6 21:05:18 sibirocobombus kernel: [18425717.168963] apache_accesses invoked oom-killer: ... Nov 6 21:05:18 sibirocobombus kernel: [18425717.985385] apache_accesses invoked oom-killer: ... Nov 6 21:05:20 sibirocobombus kernel: [18425719.907638] vmstat invoked oom-killer: ... Nov 6 21:05:23 sibirocobombus kernel: [18425722.207184] jbd2/vda1-8 invoked oom-killer: ... Nov 6 21:05:24 sibirocobombus kernel: [18425723.800850] memory invoked oom-killer: ... Nov 6 21:05:27 sibirocobombus kernel: [18425726.582062] exiqgrep invoked oom-killer: ... Nov 6 21:05:29 sibirocobombus kernel: [18425728.479563] fail2ban-server invoked oom-killer: ... Nov 6 21:05:31 sibirocobombus kernel: [18425730.862151] fail2ban-client invoked oom-killer: ... Nov 6 21:05:31 sibirocobombus kernel: [18425731.049454] /home/alex/camp invoked oom-killer: ...
OK, and what happened back then?
Grepping `/var/log/apache2/access.log` for this minute does not reveal anything interesting.
My gopher logs show just two hits, no issues.
Checking the toadfarm log file shows the first problem at 21:05:00: “fork failed: Cannot allocate memory”. Hm. The other problem is at 21:05:11: “Worker 4166 has no heartbeat (30 seconds), restarting”. Could be anything? Grepping for “Inactivity timeout” gives me 36 hits from 21:04:56 to 21:05:12. The end must have come quickly.
Hm. Looking at the toadfarm log file, the number of hits to the website of my Arabic teacher in just two minutes is pretty amazing.
arabisch-lernen.org 388 56% emacswiki.org 273 40% campaignwiki.org 14 2% 6 0%
This is `grep '21:0[45]:.*' /home/alex/farm/farm*.log | /home/alex/bin/site-grouping` and the second script is very simple:
#!/usr/bin/env perl while (<STDIN>) { m!GET https?://([^:]+):!; $site{$1}++; $total++; } @result = sort {$site{$b} <=> $site{$a}} keys %site; foreach $label (@result) { printf "%20s %10d %3d%%\n", $label, $site{$label}, 100* $site{$label} / $total; }
OK, so who’s calling these two sites?
Note that this is toadfarm reporting. At the same time, Apache isn’t reporting as much. `grep "21:0[45]" /var/log/apache2/access.log | wc -l` says only 183 requests. When I look at the IP numbers, I don’t see a single IP number doing anything wrong. After a while I do think that there’s a lot of IP numbers starting with a common prefix and so I’m going to count them. 62 hits in two minutes from ChinaNetCenter Ltd. Gaaaah. Looking at the URLs these IP numbers are hitting it’s full of admin actions, diffs, edit action. Clearly, a misbehaving spider. I think I’m going to ban them all, for now.
Soon, I’m going follow these instructions. Grrrrr! Unexpected DDOS: Blocking China with ipset and iptables. (Just kidding. I hate that!)
Unexpected DDOS: Blocking China with ipset and iptables
Gah. Problem. `ipset add blacklist 220.243.0.0/16` ended with an error. The list is full! And I cannot just `ipset del blacklist 220.243.0.0/16` because that gives me another error saying the element isn’t added. Not even `for ip in $(ipset list blacklist|grep 220); do ipset del blacklist $ip; done` seems to work. Or at least, judging from a few calls to `ipset list blacklist | wc -l` it’s not very fast. What now?
I guess I will have to look at my setup again. I probably need a different type? It’s type is `hash:ip` and it’s maxelem is 65536. That would have to increase, of course. Oh well. Perhaps I’ll leave it for some other time. We’ll see how far we’ll get by not doing anything.
#Toadfarm #Administration
(Please contact me if you want to remove your comment.)
⁂
Over on Mastodon, @oliof has suggested taking a look at ulimit.
– Alex Schroeder 2018-05-09 19:07 UTC
---
How is 388 requests in 2 minutes a problem? That’s no more than 4 requests per second - shouldn’t the webserver be able to handle that?
Is there some kind of memory leak or pathological case that the chinese crawlers happen to hit? The sudden spike in memory use is not just because of 4 hits/s, is it?
Can you reproduce the memory use, maybe on a local machine replaying the requests from the log around the time of the problem (or just bombarding it with requests)?
Just random questions from my armchair 😃
– Adam 2018-11-09 21:46 UTC
---
Heh. 😄 The main problem is that we’re not talking about static HTML pages. If somebody does a search on this wiki, there’s a grep over 7000 pages, and then the wiki code looks at the matches, renders the pages, highlights the phrases, and so on. That means the process takes a while. If it takes 10 seconds, and I get 4 searches per second (I doubt this was the explanation, just trying to explain how the server is prone to accidental denial of service attacks), then I have 40 processes grepping through 7000 files per second. And then it dies, because this is a tiny 2G virtual machine I’m somewhere, managed by myself and nobody else. At least that’s what I suspect.
That’s why my `/robots.txt` says `Crawl-delay: 20`.
Replaying the requests from the log is a good idea. I should do it the next time this happens. And I should make sure to compare the memory used with the number of processes running at the time.
– Alex Schroeder 2018-11-10 00:30 UTC
---
More armchair comments: Could you exclude the search URL in robots.txt?
A more laborous solution, if searching turns out to be the problem, is to index the pages using, say, Xapian (the Perl-bindings work nicely).
Anyway, good luck in finding the cause!
– Adam 2018-11-10 13:49 UTC
---
As they are already ignoring the crawl delay I think it is safe to say that they either don’t know or don’t care. My `robot.txt` Contains the line `Disallow: /wiki?` which should take care of anything that is not a page view. But it’s being ignored.
A long time ago I used a Perl module that indexed the text and allowed scoring of results - but eventually it turned out that brute forcing search using grep was actually faster. But that was years ago and on a different setup. Perhaps I should revisit this.
These days I am more interested in blocking misbehaving agents. The old solution limits visits from the same IP number (surge protection, and I have `fail2ban` watching my log files, to watch over all the other services (`cgit` and the like). But if the request come from all over a subnet, these defenses don’t work, unfortunately.
– Alex Schroeder 2018-11-10 18:09 UTC