2019-06-25 Bots rule the web

I looked at my web logs again. I have a script that looks at the user agent entry in my logs and uses the following regular expression to figure out what sort of bot we’re looking at:

/([a-z0-9@+.]*bot[a-z0-9@+.]*)/i

Thus, some sort of word or email address containing the word “bot”. Given a user agent like the following, the script counts that as a hit for “serpstatbot”.

"serpstatbot/1.0 (advanced backlink tracking bot; http://serpstatbot.com/; abuse@serpstatbot.com)"

Here’s data showing that 21% of my hits are bots (18253 / 88862). Of these, 20% are by the Google bot, 19% are by the Bing bot, 10% are by the Yandex bot, 5% are by the Apple bot, and so on. And that is considering a long robots.txt file!

robots.txt

Stupid bots! 😠

I’m going to add `serpstatbot` to my `robots.txt` files.

Notice the entry that just says `bot`. How stupid is that? One possible culprit is this one, as “bot” is the first word in the user agent matching the string “bot”. Thanks for nothing, Quant.

Mozilla/5.0 (compatible; Qwantify/Bleriot/1.1; +https://help.qwant.com/bot)

And now for the data:

# /home/alex/bin/bot-detector < /var/log/apache2/access.log
    ----------------------------Bandwidth-------Hits-------Actions--Delay
                     Everybody      3417M      88862
                      All Bots       767M      18253   100%     9%
    ---------------------------------------------------------------------
                     Googlebot     62588K       3818    20%    21%    15s
                       bingbot    493730K       3540    19%     6%    17s
                   serpstatbot     15398K       2965    16%     3%    20s
                     YandexBot    148204K       1949    10%     0%    30s
                      Applebot      5840K        916     5%     0%    66s
                         CCBot      6880K        769     4%    21%    78s
                           bot     14858K        755     4%     5%    79s
        +centurybot9@gmail.com      5086K        567     3%     2%   106s
                        DotBot       959K        351     1%     0%   171s
                      chimebot     14805K        266     1%     0%   228s
                       Gigabot      2246K        245     1%     0%   248s
                        Exabot      1994K        238     1%    29%   254s
                    SemrushBot       633K        196     1%     0%   304s
                      Slackbot      1203K        179     0%    95%   338s
                        robots      1643K        146     0%     0%   408s
                   ZoominfoBot      1110K        134     0%     0%   447s
                      Cliqzbot       577K        133     0%     0%   436s
                       BLEXBot       189K        131     0%     0%   435s
                    robot.html       947K        111     0%     0%   417s
                           Bot       843K        104     0%     0%   579s
                    istellabot       283K         88     0%    14%   285s
           trendictionbot0.5.0       301K         61     0%    39%   982s
                    PaperLiBot       455K         52     0%    50%   1087s
                DomainStatsBot      1199K         47     0%    14%   137s
                       MagiBot       400K         46     0%     0%   1309s
                    Twitterbot       256K         43     0%     0%   1264s
                     MojeekBot        88K         41     0%     2%   1183s
                      rogerbot       197K         40     0%     0%   1314s
                          bots       207K         40     0%     0%   1396s
                    SEMrushBot       105K         36     0%     0%   1594s
                     coccocbot       149K         27     0%     0%   1748s
                       yacybot        78K         20     0%     5%   2276s
                     RSSingBot       131K         19     0%     0%   2561s
                       MJ12bot        43K         16     0%     0%   3706s
                     BoxcarBot         6K         14     0%   100%   4479s
                        SMTBot       129K         13     0%     0%     8s
                         ICBot        40K         12     0%     0%   2256s
           bot@linkfluence.com        85K         11     0%     0%   4936s
                     Uptimebot        45K         10     0%     0%   5505s
                   SurdotlyBot        30K          8     0%     0%     0s
        YandexAccessibilityBot        63K          7     0%     0%   5470s
                  TweetmemeBot        37K          6     0%     0%   4566s
                       feedbot       639K          6     0%    33%   7430s
                  Laserlikebot        19K          5     0%     0%   9075s
                        ZumBot        72K          5     0%     0%     1s
                          oBot        15K          5     0%     0%   3495s
               Mediatoolkitbot        49K          4     0%     0%   10673s
                    startmebot        38K          4     0%     0%   1991s
                    toot.robot        16K          4     0%     0%     1s
               YandexMobileBot        46K          4     0%     0%   13973s
                     AhrefsBot         9K          4     0%     0%   4874s
                   DuckDuckBot        17K          4     0%     0%     1s
                     SabsimBot        60K          4     0%    25%   3773s
                       ZoomBot        17K          4     0%     0%     0s
                 wiederfreibot        36K          3     0%    66%   21665s
                  OutclicksBot         1K          3     0%     0%   6324s
                   TelegramBot        28K          3     0%     0%   9789s
                      bot.html        46K          3     0%     0%   15600s
                      bitlybot        11K          2     0%     0%     0s
                  botsin.space         7K          2     0%     0%   2114s
                     BublupBot         8K          2     0%     0%   23512s
                    robots.txt         9K          2     0%     0%     0s
                  AwarioRssBot        28K          2     0%   100%   3831s
                    Discordbot        50K          2     0%     0%   9774s
                     redditbot         4K          2     0%     0%     1s
                   newsbots.eu         4K          1     0%     0%     0s
                  OnalyticaBot         4K          1     0%     0%     0s
                    LivelapBot        10K          1     0%     0%     0s
                       Facebot       219K          1     0%     0%     0s

Remember, that already takes into account all the bots that don’t crawl my sites because of `robots.txt`.

“Actions” are those URLs that contain a query parameter called “action” as this is an indication for misbehaving bots that follow links they shouldn’t follow.

“Delay” is there to show whether the bot observes the crawl delay I specify in my `robots.txt`.

If you look at the source code, you’ll see that my log files no longer contain any IP numbers. That’s how I try to protect the privacy of my visitors, even against myself. 🙄

Source code:

#!/usr/bin/perl
use strict;
use Time::ParseDate;

my $all = grep /--all/, @ARGV;
my %agent;
my %action;
my %bandwidth;
my $actions;
my $hits;
my $bandwidth;
my $bot_bandwidth;
my $bot_hits;
my %first;
my %last;
while (<STDIN>) {
  # example line from my log file
  # www.emacswiki.org:443 - [25/Jun/2019:14:01:04 +0200] "GET /images/logo218x38.png HTTP/1.1" 200 3919 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/75.0.3770.90 Chrome/75.0.3770.90 Safari/537.36"

  #  parse line
  m/^(\S+:\d+) (-|admin) \[(.*?)\] "(.*?)" (\d+) (\d+|-) "(.*?)" "(.*?)"/ or warn "Cannot parse:\n$_" and next;
  my ($host, $user, $time, $request, $code, $bytes, $referrer, $agent) = ($1, $2, $3, $4, $5, $6, $7, $8, $9);

  # determine the value of $uri
  my ($method, $uri, $junk) = split(' ', $request, 3);

  warn "Cannot parse: $_\n" unless $host;
  $hits++;
  $bandwidth += $bytes;
  my $domain;
  next unless $all or ($domain) = $agent =~ /([a-z0-9@+.]*bot[a-z0-9@+.]*)/i;
  my $key = $domain;
  $key = $1 if not $key and $agent =~ /https?:\/\/([^ \/()]+)/; # prefer just the domain of the bot
  $key ||= $agent; # fallback: everything
  $agent{$key}++;
  $bandwidth{$key} += $bytes;
  $bot_hits++;
  $bot_bandwidth += $bytes;

  my $date = parsedate($time);
  $first{$key} = $date unless $first{$key};
  $last{$key} = $date;

  if ($uri =~ /action=/i) {
    $actions++;
    $action{$key}++;
  }
}
my @result = sort {$agent{$b} <=> $agent{$a}} keys %agent;

print "    ----------------------------Bandwidth-------Hits-------Actions--Delay\n";
printf "%30s %9dM %10d\n", 'Everybody',
  $bandwidth / 1024 / 1024, $hits;
printf "%30s %9dM %10d   %3d%%   %3d%%\n", 'All Bots',
  $bot_bandwidth / 1024 / 1024, $bot_hits, 100, 100 * $actions / $bot_hits;
print "    ---------------------------------------------------------------------\n";
foreach my $key (@result) {
  my $avg = "";
  if ($first{$key} and $last{$key} and $agent{$key} > 1) {
    $avg = ($last{$key} - $first{$key}) / ($agent{$key} -1);
  }
  printf "%30s %9dK %10d   %3d%%   %3d%%   %3ds\n", $key,
      $bandwidth{$key} / 1024,
      $agent{$key},
      100 * $agent{$key} / $bot_hits,
      100 * $action{$key} / $agent{$key},
      $avg;
}

​#Bots ​#Administration ​#Butlerian Jihad

@wion

Blocking robots on your web page – the list of 1800 bad bots

One the one hand, wow! So much block! On the other hand, reading the comments it is obvious that there will be the occasional false positive in that list, and that make me wary, and weary.

Apache config file to block user agents

49.6% of all internet traffic came from bots in 2023, a 2% increase over the previous year, and the highest level Imperva has reported since it began monitoring automated traffic in 2013. – Bots dominate internet activity, account for nearly half of all traffic

Bots dominate internet activity, account for nearly half of all traffic