2008-10-09 How Many Bot Hits

How many hits can be bot hits? I wrote a little Perl script (bot-analyze) to figure out what I’m currently getting:

bot-analyze

    ---------------------------------Hits-------Actions
                         Total      50709   100%     1%
    ---------------------------------------------------
                     Googlebot       6600    13%     0%
                        msnbot       3368     6%     1%
                        robots       1342     2%    50%
                         robot        134     0%    24%
                   BecomeJPBot         89     0%     0%
                      VoilaBot         83     0%     0%
                        MSRBOT         76     0%     5%

Is that good or bad?

I’m actually trying to figure out if any bots are misbehaving or if my wiki script is providing good enough “guide posts” to point the bots in the right direction.

The column *Actions* refer to bots hitting URLs of the form `http://example.org/wiki?action=foo`. Those pages have HTML headers saying `<meta name="robots" content="NOINDEX,FOLLOW" />`. The intent was that bots should a. not index them and b. waste little time crawling them. Apparently one of the bots is misbehaving... 😄

Here a sample of `grep action=.*robots logs/access.log` – apparently history and contrib actions!

61.247.222.53 - - [10/Oct/2008:06:41:54 +0200] "GET /cgi-bin/wiki?action=history;id=EmacsChannel HTTP/1.1" 200 6979 "-" "Yeti/1.0 (NHN Corp.; http://help.naver.com/robots/)"
61.247.222.53 - - [10/Oct/2008:06:42:39 +0200] "GET /cgi-bin/wiki?action=history;id=BooksAboutEmacs HTTP/1.1" 200 7158 "-" "Yeti/1.0 (NHN Corp.; http://help.naver.com/robots/)"
61.247.222.54 - - [10/Oct/2008:06:42:53 +0200] "GET /cgi-bin/wiki?action=contrib;id=timid.el HTTP/1.1" 200 5772 "-" "Yeti/1.0 (NHN Corp.; http://help.naver.com/robots/)"
61.247.222.56 - - [10/Oct/2008:06:43:33 +0200] "GET /cgi-bin/wiki?action=contrib;id=info%2B.el HTTP/1.1" 200 5858 "-" "Yeti/1.0 (NHN Corp.; http://help.naver.com/robots/)"
61.247.222.55 - - [10/Oct/2008:06:44:21 +0200] "GET /cgi-bin/wiki?action=history;id=CollectionOfEmacsDevelopmentEnvironmentTools HTTP/1.1" 200 7827 "-" "Yeti/1.0 (NHN Corp.; http://help.naver.com/robots/)"

All the links *to* such history actions are through links using `rel="nofollow"`! I think this is enough of a reason to ban this Yeti bot.

In my `.htaccess` file:

RewriteCond %{HTTP_USER_AGENT} ^Yeti
RewriteRule ./ /banned_user_agent.html

Check it out using `curl -A "Yeti/1.0 (NHN Corp.; http://help.naver.com/robots/)" http://www.emacswiki.org/emacs/test`. 😄

Check it out

​#Web