By chance, I run my leech-detector script and find the following:
aschroeder@thinkmo:~$ leech-detector < logs/access.log | head IP Number hits bandw. hits% interv. status code distrib. 184.82.236.206 14368 118K 17% 4.8s 301 (49%), 200 (49%), 404 (0%), 302 (0%), 403 (0%) 125.199.78.207 3419 11K 4% 20.2s 200 (52%), 404 (43%), 302 (2%), 400 (1%), 301 (0%), 304 (0%) ...
What the hell is this guy doing causing 17% of all my hits?
aschroeder@thinkmo:~$ tail -f logs/access.log | grep 184.82.236.206 184.82.236.206 - - [13/May/2012:01:44:12 +0200] "GET /emacs?action=browse;id=icicles.el;revision=835 HTTP/1.0" 301 447 "http://www.emacswiki.org/emacs/?action=rc&all=1&showedit=1&from=1&rcuseronly=DrewAdams" "Wget/1.12 (linux-gnu)" 184.82.236.206 - - [13/May/2012:01:44:16 +0200] "GET /emacs/?action=browse;id=icicles.el;revision=835 HTTP/1.0" 200 127350 "http://www.emacswiki.org/emacs/?action=rc&all=1&showedit=1&from=1&rcuseronly=DrewAdams" "Wget/1.12 (linux-gnu)" 184.82.236.206 - - [13/May/2012:01:44:21 +0200] "GET /emacs?action=browse;diff=2;id=icicles-cmd2.el;diffrevision=55 HTTP/1.0" 301 462 "http://www.emacswiki.org/emacs/?action=rc&all=1&showedit=1&from=1&rcuseronly=DrewAdams" "Wget/1.12 (linux-gnu)" 184.82.236.206 - - [13/May/2012:01:44:25 +0200] "GET /emacs/?action=browse;diff=2;id=icicles-cmd2.el;diffrevision=55 HTTP/1.0" 200 482243 "http://www.emacswiki.org/emacs/?action=rc&all=1&showedit=1&from=1&rcuseronly=DrewAdams" "Wget/1.12 (linux-gnu)"
Ahhh! A stupid leech using *wget* to pull the entire site, following all the links, ignoring the rel=”nofollow” rules... Maybe a dude that didn’t read the WikiDownload page. It also looks to me as if the links are listed in the site’s robots.txt file.
Oh well. The solution, unfortunately, seems to involve editing `cgi-bin/.htaccess` and adding the following:
# using wget to get everything including actions, old stuff, etc. Deny from 184.82.236.206
#Wikis #Bots