I have this little script called leech-detector that I run on log files every now and then:
aschroeder@thinkmo:~$ leech-detector < logs/access.log | head 66.249.65.101 2654 11K 9% 8.7s 200 (86%), 404 (7%), 302 (2%), 304 (2%), 301 (0%), 501 (0%) 72.30.142.82 1055 14K 3% 22.0s 200 (91%), 304 (2%), 404 (2%), 301 (2%), 302 (0%) 87.250.253.241 979 20K 3% 23.7s 200 (96%), 404 (1%), 301 (1%) 67.218.116.134 868 12K 2% 26.7s 200 (85%), 301 (9%), 404 (3%), 302 (1%), 403 (0%) 72.14.199.225 413 15K 1% 56.1s 200 (50%), 304 (41%), 301 (7%) 67.195.110.173 293 18K 1% 78.6s 200 (76%), 304 (7%), 301 (7%), 404 (6%), 302 (2%)
So, what do we have? Using whois on them:
I guess that assuming 20% search engine traffic is still a good estimate.
Actually, when grouping by user agent instead of the IP number using bot-analyze, we get an even higher number. Apparently the Microsoft bot is distributed over various IP numbers so that it doesn’t show up in the list above:
aschroeder@thinkmo:~$ bot-analyze < logs/access.log | head ---------------------------------Hits-------Actions Total 29979 100% 0% --------------------------------------------------- msnbot 7437 24% 1% Googlebot 2817 9% 0% robot 950 3% 3% Exabot 137 0% 2% robots 49 0% 0% qihoobot 26 0% 30% MSNBOT 22 0% 0%
That’s a full 33% of my hits for search engines. Then again, I pay for bandwidth and not hits. Should I rewrite this to show bandwidth?
I wonder how bigger sites handle it. Perhaps I could decide that 3% hits is enough, and therefore serve a 503 Service Unavailable response for 10 in 11 bot requests? In order to implement this, my script would still have to start up, however, eating CPU cycles and disc access.
Or is 33% a sign of bad design decisions I made? Am I sending the bots around in loops, changing the URLs all the time, making them think they have discovered new pages they have not indexed, yet?
#Web #Search
(Please contact me if you want to remove your comment.)
⁂
I can definitely identify with the experience. There’s nothing easier to get wrong than how robots harvest a CGI application. I remember looking at someone’s Perl based Web calendar that had “next” and “previous” buttons that went to infinity. There’s a lot of fear about SQL injection, but Web interfaces that fail to halt can wreak havoc.
*Should I rewrite this to show bandwidth?*
Probably, because there’s a chance that Web robots DTRT and use the HTTP HEAD method for seeing if there are page updates available. According to RFC 2616#sec9.4:
*The HEAD method is identical to GET except that the server MUST NOT return a message-body in the response. The metainformation contained in the HTTP headers in response to a HEAD request SHOULD be identical to the information sent in response to a GET request. This method can be used for obtaining metainformation about the entity implied by the request without transferring the entity-body itself. This method is often used for testing hypertext links for validity, accessibility, and recent modification.*
Do the logs suggest they use this method in their hits?
Tooling around with Wget I noticed that timestamping doesn’t work in Oddmuse. Perhaps bots do more sophisticated things with Etag, but seems like it would be a smart thing to send the Last-modification header for Oddmuse:Caching purposes.
– AaronHawley 2009-10-15 16:46 UTC
---
And of course the Oddmuse calendar also goes to infinity. But I hope I’m using robot meta tags and nofollow attributes to prevent well-behaved bots from following the links. Maybe I should check, just to make sure this works as intended.
As far as I understand it the Etag is actually superior to the last modified logic. But perhaps misbehaved bots implement but half the RFC. Also something to check!
As for HEAD requests, I have sad news. Here’s some perspective:
aschroeder@thinkmo:~/logs$ grep "HEAD .*bot" access.log | wc -l 9 aschroeder@thinkmo:~/logs$ grep "GET .*bot" access.log | wc -l 33913
I don’t think I need to look at the numbers.
Hm, there’s also the crawl-delay directive… Interesting.
Here is the result for about 19 hours of service – 2009-10-15 06:27 to 2009-10-16 01:43.
----------------------------Bandwidth-------Hits-------Actions Everybody 1365M 94959 All Bots 370M 34894 100% 1% -------------------------------------------------------------- search.msn.com 241198K 22674 64% 1% www.google.com 89422K 7710 22% 1% www.cuil.com 31306K 2609 7% 0% www.majestic12.co.uk 5060K 903 2% 0% help.naver.com 1480K 348 0% 1% www.exabot.com 2966K 135 0% 2%
Thus: 37% of all hits are bots, 27% of all bandwidth is bots. Yikes!
Strangely enough the numbers seem to indicate that bots prefer shorter pages – there might still be an opportunity to save some bandwidth, somewhere.
I think this gives me enough reason to explore bot behaviour some more.
The low percentage of actions is good news. All the stuff I’m afraid of – infinite recent changes, infinite calendars, eternal digging in the page history, all this stuff requires an action. A high percentage of actions is an indication of bots that ignore nofollow attributes and robot meta information.
The numbers also seem to say that Microsoft is hitting the site once every 3 seconds where as Google only hits us once every nine seconds. Perhaps a crawl delay of 10 would be appropriate?
---
I bet those HEAD requests were my own from today using Wget with a user-agent set to impersonate a bot.
I can’t believe that bots don’t use the HEAD method. What a sad state of affairs. Perhaps, if they notice that the HEAD response is unhelpful from a Web server they drop into a different mode that doesn’t bother with HEAD reqeusts. But I couldn’t find any documentation about Web crawlers using the HEAD. More likely I’ve completely mislead you. Search engines more likely send GET requests but rely on the If-Modified-Since header.
I’m not able to check Oddmuse’s behavior on this front with Wget. It doesn’t use the If-Modified-Since or Etag headers in its requests. The Wget manual says itself, “Arguably, HTTP time-stamping should be implemented using the `If-Modified-Since’ request.”
As I wrote in the Comments on 2009-06-12 Referrers I think the costs of crawling needs to benefit ones visitor traffic. I could understand a decade ago, when there were many search engines, that a lot of Web server load would be from them. Now it is 2009, and there really are only a couple of search engines. Having a third of your traffic be from them is absurd. Potentially, this Crawl-delay rule in your Robots.txt file could lower two-thirds (2/3) of your server load from bots since the setting effects MSN (and not Google). I suggest you try it for that reason and as an excuse just to see if it really works. 😄
Comments on 2009-06-12 Referrers
*I think this gives me enough reason to explore bot behaviour some more.*
Obviously, studying your access logs reveals how these bots work. However, I found the best way to check a dynamic site’s behavior by a bot is to operate your own bot for your site. In the old days, everyone maintained a search engine for their site. This is no longer the case with content management systems having searchable database backends but mostly people just relying on Google. People can use Google Custom search, Google Webmaster tools or ignore the problems all together and pray Google tunes things for them.
I worked with Mnogo and Htdig long ago and used them to crawl a site but never published the search interface. Their crawlers actually follow standards close enough that they may not gleam anything useful if Oddmuse does everything it should correctly. I bet the exercise would provide a few insights to improve Oddmuse, though.
– AaronHawley 2009-10-16 05:49 UTC
---
Oddmuse:Download Using Curl And Caching
Apparently search.msn.com is misbehaved!
Big surprise…
----------------------------Bandwidth-------Hits-------Actions Everybody 1093M 72604 All Bots 296M 28262 100% 1% -------------------------------------------------------------- search.msn.com 195956K 19092 67% 1%
Today I looked at the bot-analyze output again and noticed an increase in actions:
help.naver.com 1791K 505 1% 9% www.shopwiki.com 2267K 172 0% 69% www.wasalive.com 5475K 43 0% 18% robot 177K 30 0% 100% www.plazoo.com) 948K 23 0% 82% www.kalooga.com 0K 16 0% 6%
I decided to investigate. *robot* is in fact *YandexBlog* which only fetches RSS feeds. Ok, a feed reader. *shopwiki* and *naver* were fetching a lot of pointless `GET /emacs-de?action=` stuff. I noticed that I was missing appropriate Disallow rules in my `robots.txt` and added the various languages. Let’s see whether this continues.
The remaining numbers are really small...