How many hits can be bot hits? I wrote a little Perl script (bot-analyze) to figure out what I’m currently getting:
---------------------------------Hits-------Actions Total 50709 100% 1% --------------------------------------------------- Googlebot 6600 13% 0% msnbot 3368 6% 1% robots 1342 2% 50% robot 134 0% 24% BecomeJPBot 89 0% 0% VoilaBot 83 0% 0% MSRBOT 76 0% 5%
Is that good or bad?
I’m actually trying to figure out if any bots are misbehaving or if my wiki script is providing good enough “guide posts” to point the bots in the right direction.
The column *Actions* refer to bots hitting URLs of the form `http://example.org/wiki?action=foo`. Those pages have HTML headers saying `<meta name="robots" content="NOINDEX,FOLLOW" />`. The intent was that bots should a. not index them and b. waste little time crawling them. Apparently one of the bots is misbehaving... 😄
Here a sample of `grep action=.*robots logs/access.log` – apparently history and contrib actions!
61.247.222.53 - - [10/Oct/2008:06:41:54 +0200] "GET /cgi-bin/wiki?action=history;id=EmacsChannel HTTP/1.1" 200 6979 "-" "Yeti/1.0 (NHN Corp.; http://help.naver.com/robots/)" 61.247.222.53 - - [10/Oct/2008:06:42:39 +0200] "GET /cgi-bin/wiki?action=history;id=BooksAboutEmacs HTTP/1.1" 200 7158 "-" "Yeti/1.0 (NHN Corp.; http://help.naver.com/robots/)" 61.247.222.54 - - [10/Oct/2008:06:42:53 +0200] "GET /cgi-bin/wiki?action=contrib;id=timid.el HTTP/1.1" 200 5772 "-" "Yeti/1.0 (NHN Corp.; http://help.naver.com/robots/)" 61.247.222.56 - - [10/Oct/2008:06:43:33 +0200] "GET /cgi-bin/wiki?action=contrib;id=info%2B.el HTTP/1.1" 200 5858 "-" "Yeti/1.0 (NHN Corp.; http://help.naver.com/robots/)" 61.247.222.55 - - [10/Oct/2008:06:44:21 +0200] "GET /cgi-bin/wiki?action=history;id=CollectionOfEmacsDevelopmentEnvironmentTools HTTP/1.1" 200 7827 "-" "Yeti/1.0 (NHN Corp.; http://help.naver.com/robots/)"
All the links *to* such history actions are through links using `rel="nofollow"`! I think this is enough of a reason to ban this Yeti bot.
In my `.htaccess` file:
RewriteCond %{HTTP_USER_AGENT} ^Yeti RewriteRule ./ /banned_user_agent.html
Check it out using `curl -A "Yeti/1.0 (NHN Corp.; http://help.naver.com/robots/)" http://www.emacswiki.org/emacs/test`. 😄
#Web