2012-05-12 Stupid Leeches

By chance, I run my leech-detector script and find the following:

aschroeder@thinkmo:~$ leech-detector < logs/access.log | head
           IP Number       hits bandw. hits% interv. status code distrib.
      184.82.236.206      14368   118K  17%    4.8s  301 (49%), 200 (49%), 404 (0%), 302 (0%), 403 (0%)
      125.199.78.207       3419    11K   4%   20.2s  200 (52%), 404 (43%), 302 (2%), 400 (1%), 301 (0%), 304 (0%)
                 ...

What the hell is this guy doing causing 17% of all my hits?

aschroeder@thinkmo:~$ tail -f logs/access.log | grep 184.82.236.206
184.82.236.206 - - [13/May/2012:01:44:12 +0200] "GET /emacs?action=browse;id=icicles.el;revision=835 HTTP/1.0" 301 447 "http://www.emacswiki.org/emacs/?action=rc&all=1&showedit=1&from=1&rcuseronly=DrewAdams" "Wget/1.12 (linux-gnu)"
184.82.236.206 - - [13/May/2012:01:44:16 +0200] "GET /emacs/?action=browse;id=icicles.el;revision=835 HTTP/1.0" 200 127350 "http://www.emacswiki.org/emacs/?action=rc&all=1&showedit=1&from=1&rcuseronly=DrewAdams" "Wget/1.12 (linux-gnu)"
184.82.236.206 - - [13/May/2012:01:44:21 +0200] "GET /emacs?action=browse;diff=2;id=icicles-cmd2.el;diffrevision=55 HTTP/1.0" 301 462 "http://www.emacswiki.org/emacs/?action=rc&all=1&showedit=1&from=1&rcuseronly=DrewAdams" "Wget/1.12 (linux-gnu)"
184.82.236.206 - - [13/May/2012:01:44:25 +0200] "GET /emacs/?action=browse;diff=2;id=icicles-cmd2.el;diffrevision=55 HTTP/1.0" 200 482243 "http://www.emacswiki.org/emacs/?action=rc&all=1&showedit=1&from=1&rcuseronly=DrewAdams" "Wget/1.12 (linux-gnu)"

Ahhh! A stupid leech using *wget* to pull the entire site, following all the links, ignoring the rel=”nofollow” rules... Maybe a dude that didn’t read the WikiDownload page. It also looks to me as if the links are listed in the site’s robots.txt file.

robots.txt

Oh well. The solution, unfortunately, seems to involve editing `cgi-bin/.htaccess` and adding the following:

# using wget to get everything including actions, old stuff, etc.
Deny from 184.82.236.206

​#Wikis ​#Bots