2009-10-15 Network Traffic

I have this little script called leech-detector that I run on log files every now and then:

aschroeder@thinkmo:~$ leech-detector < logs/access.log | head
       66.249.65.101       2654    11K   9%    8.7s  200 (86%), 404 (7%), 302 (2%), 304 (2%), 301 (0%), 501 (0%)
        72.30.142.82       1055    14K   3%   22.0s  200 (91%), 304 (2%), 404 (2%), 301 (2%), 302 (0%)
      87.250.253.241        979    20K   3%   23.7s  200 (96%), 404 (1%), 301 (1%)
      67.218.116.134        868    12K   2%   26.7s  200 (85%), 301 (9%), 404 (3%), 302 (1%), 403 (0%)
       72.14.199.225        413    15K   1%   56.1s  200 (50%), 304 (41%), 301 (7%)
      67.195.110.173        293    18K   1%   78.6s  200 (76%), 304 (7%), 301 (7%), 404 (6%), 302 (2%)

So, what do we have? Using whois on them:

*66.249.65.101**: Google (”GOGL”)

*72.30.142.82**: Inktomi (”INKT” – Yahoo, I think)

*87.250.253.241**: Yandex (”YANDEX-252” a Russian search engine?)

*67.218.116.134**: Layer42 (”LAYER-1” – huh? a “web solution provider” – maybe a caching proxy for their network?)

*72.14.199.225**: Google (”GOGL” again)

*67.195.110.173**: Yahoo (”YHOO”)

I guess that assuming 20% search engine traffic is still a good estimate.

still a good estimate

Actually, when grouping by user agent instead of the IP number using bot-analyze, we get an even higher number. Apparently the Microsoft bot is distributed over various IP numbers so that it doesn’t show up in the list above:

bot-analyze

aschroeder@thinkmo:~$ bot-analyze < logs/access.log | head
    ---------------------------------Hits-------Actions
                         Total      29979   100%     0%
    ---------------------------------------------------
                        msnbot       7437    24%     1%
                     Googlebot       2817     9%     0%
                         robot        950     3%     3%
                        Exabot        137     0%     2%
                        robots         49     0%     0%
                      qihoobot         26     0%    30%
                        MSNBOT         22     0%     0%

That’s a full 33% of my hits for search engines. Then again, I pay for bandwidth and not hits. Should I rewrite this to show bandwidth?

I wonder how bigger sites handle it. Perhaps I could decide that 3% hits is enough, and therefore serve a 503 Service Unavailable response for 10 in 11 bot requests? In order to implement this, my script would still have to start up, however, eating CPU cycles and disc access.

Or is 33% a sign of bad design decisions I made? Am I sending the bots around in loops, changing the URLs all the time, making them think they have discovered new pages they have not indexed, yet?

#Web #Search

Comments

(Please contact me if you want to remove your comment.)

⁂

I can definitely identify with the experience. There’s nothing easier to get wrong than how robots harvest a CGI application. I remember looking at someone’s Perl based Web calendar that had “next” and “previous” buttons that went to infinity. There’s a lot of fear about SQL injection, but Web interfaces that fail to halt can wreak havoc.

*Should I rewrite this to show bandwidth?*

Probably, because there’s a chance that Web robots DTRT and use the HTTP HEAD method for seeing if there are page updates available. According to RFC 2616#sec9.4:

RFC 2616#sec9.4

*The HEAD method is identical to GET except that the server MUST NOT return a message-body in the response. The metainformation contained in the HTTP headers in response to a HEAD request SHOULD be identical to the information sent in response to a GET request. This method can be used for obtaining metainformation about the entity implied by the request without transferring the entity-body itself. This method is often used for testing hypertext links for validity, accessibility, and recent modification.*

Do the logs suggest they use this method in their hits?

Tooling around with Wget I noticed that timestamping doesn’t work in Oddmuse. Perhaps bots do more sophisticated things with Etag, but seems like it would be a smart thing to send the Last-modification header for Oddmuse:Caching purposes.

Oddmuse:Caching

– AaronHawley 2009-10-15 16:46 UTC

AaronHawley

---

*2009-10-15**. Good point regarding HEAD. I think I added support for HEAD to Oddmuse. I think.

And of course the Oddmuse calendar also goes to infinity. But I hope I’m using robot meta tags and nofollow attributes to prevent well-behaved bots from following the links. Maybe I should check, just to make sure this works as intended.

As far as I understand it the Etag is actually superior to the last modified logic. But perhaps misbehaved bots implement but half the RFC. Also something to check!

As for HEAD requests, I have sad news. Here’s some perspective:

aschroeder@thinkmo:~/logs$ grep "HEAD .*bot" access.log | wc -l
9
aschroeder@thinkmo:~/logs$ grep "GET .*bot" access.log | wc -l
33913

I don’t think I need to look at the numbers.

Hm, there’s also the crawl-delay directive… Interesting.

crawl-delay

Here is the result for about 19 hours of service – 2009-10-15 06:27 to 2009-10-16 01:43.

----------------------------Bandwidth-------Hits-------Actions
                 Everybody      1365M      94959
                  All Bots       370M      34894   100%     1%
--------------------------------------------------------------
            search.msn.com    241198K      22674    64%     1%
            www.google.com     89422K       7710    22%     1%
              www.cuil.com     31306K       2609     7%     0%
      www.majestic12.co.uk      5060K        903     2%     0%
            help.naver.com      1480K        348     0%     1%
            www.exabot.com      2966K        135     0%     2%

Thus: 37% of all hits are bots, 27% of all bandwidth is bots. Yikes!

Strangely enough the numbers seem to indicate that bots prefer shorter pages – there might still be an opportunity to save some bandwidth, somewhere.

I think this gives me enough reason to explore bot behaviour some more.

The low percentage of actions is good news. All the stuff I’m afraid of – infinite recent changes, infinite calendars, eternal digging in the page history, all this stuff requires an action. A high percentage of actions is an indication of bots that ignore nofollow attributes and robot meta information.

The numbers also seem to say that Microsoft is hitting the site once every 3 seconds where as Google only hits us once every nine seconds. Perhaps a crawl delay of 10 would be appropriate?

---

I bet those HEAD requests were my own from today using Wget with a user-agent set to impersonate a bot.

I can’t believe that bots don’t use the HEAD method. What a sad state of affairs. Perhaps, if they notice that the HEAD response is unhelpful from a Web server they drop into a different mode that doesn’t bother with HEAD reqeusts. But I couldn’t find any documentation about Web crawlers using the HEAD. More likely I’ve completely mislead you. Search engines more likely send GET requests but rely on the If-Modified-Since header.

I’m not able to check Oddmuse’s behavior on this front with Wget. It doesn’t use the If-Modified-Since or Etag headers in its requests. The Wget manual says itself, “Arguably, HTTP time-stamping should be implemented using the `If-Modified-Since’ request.”

As I wrote in the Comments on 2009-06-12 Referrers I think the costs of crawling needs to benefit ones visitor traffic. I could understand a decade ago, when there were many search engines, that a lot of Web server load would be from them. Now it is 2009, and there really are only a couple of search engines. Having a third of your traffic be from them is absurd. Potentially, this Crawl-delay rule in your Robots.txt file could lower two-thirds (2/3) of your server load from bots since the setting effects MSN (and not Google). I suggest you try it for that reason and as an excuse just to see if it really works. 😄

Comments on 2009-06-12 Referrers

*I think this gives me enough reason to explore bot behaviour some more.*

Obviously, studying your access logs reveals how these bots work. However, I found the best way to check a dynamic site’s behavior by a bot is to operate your own bot for your site. In the old days, everyone maintained a search engine for their site. This is no longer the case with content management systems having searchable database backends but mostly people just relying on Google. People can use Google Custom search, Google Webmaster tools or ignore the problems all together and pray Google tunes things for them.

I worked with Mnogo and Htdig long ago and used them to crawl a site but never published the search interface. Their crawlers actually follow standards close enough that they may not gleam anything useful if Oddmuse does everything it should correctly. I bet the exercise would provide a few insights to improve Oddmuse, though.

– AaronHawley 2009-10-16 05:49 UTC

AaronHawley

---

*2009-10-16**. In the mean time here’s howto use etags using curl. Maybe wget offers the same functionality but curl is what I have installed by default. Oddmuse:Download Using Curl And Caching.

Oddmuse:Download Using Curl And Caching

*2009-10-17**. I just checked. I have **Crawl-delay: 20** in my robots.txt!

my robots.txt

Apparently search.msn.com is misbehaved!

Big surprise…

----------------------------Bandwidth-------Hits-------Actions
                 Everybody      1093M      72604
                  All Bots       296M      28262   100%     1%
--------------------------------------------------------------
            search.msn.com    195956K      19092    67%     1%

Today I looked at the bot-analyze output again and noticed an increase in actions:

bot-analyze

            help.naver.com      1791K        505     1%     9%
          www.shopwiki.com      2267K        172     0%    69%
          www.wasalive.com      5475K         43     0%    18%
                     robot       177K         30     0%   100%
           www.plazoo.com)       948K         23     0%    82%
           www.kalooga.com         0K         16     0%     6%

I decided to investigate. *robot* is in fact *YandexBlog* which only fetches RSS feeds. Ok, a feed reader. *shopwiki* and *naver* were fetching a lot of pointless `GET /emacs-de?action=` stuff. I noticed that I was missing appropriate Disallow rules in my `robots.txt` and added the various languages. Let’s see whether this continues.

The remaining numbers are really small...