2018-08-27 Bot Traffic

I guess machines aren’t ruling the world, yet. But bots are close to eating out web. Check out my robots.txt file. I want crawlers to use a long delay. I disallow many of them. And I still have 20% bot traffic, in a random 24h period (from `26/Aug/2018:06:25:11 +0200` to `27/Aug/2018:06:25:07 +0200`, so mostly on a Sunday).

my robots.txt

# cat /var/log/apache2/access.log.1 | /home/alex/bin/bot-detector | head
    ----------------------------Bandwidth-------Hits-------Actions--Delay
                     Everybody      6728M      86836
                      All Bots       729M      17908   100%     4%
    ---------------------------------------------------------------------
                      Applebot     27431K       5068    28%     0%    17s
                       bingbot    510446K       5015    28%     1%    17s
                     Googlebot     34651K       3113    17%    15%    27s
                     YandexBot     75768K       2753    15%     1%    31s
                        DotBot      1266K        444     2%     0%   193s
                     SeznamBot       226K        252     1%     1%   334s

Notice that across all IP numbers, abblebot and bingbot are using less than te 20s of crawl delay I ask for in my `robots.txt`.

The bot detector is very simple. Basically I’m looking for a word containing “bot” in the user agent field.

#!/usr/bin/perl
use strict;
use Time::ParseDate;

my $all = grep /--all/, @ARGV;
my %agent;
my %action;
my %bandwidth;
my $actions;
my $hits;
my $bandwidth;
my $bot_bandwidth;
my $bot_hits;
my %first;
my %last;
while (<STDIN>) {
  # condense one or more whitespace character to one single space
  s/\s+/ /go;

  #  break each apache access_log record into nine variables
  my ($host, $address, $rfc1413, $username, $time, $request,
      $status, $bytes, $referer, $agent) =
	  /^(\S+) (\S+) (\S+) (\S+) \[(.+)\] \"(.*)\" (\S+) (\S+) \"(.*)\" \"(.*)\"/;

  # determine the value of $uri
  my ($method, $uri, $junk) = split(' ', $request, 3);

  # campaignwiki.org:80 000.000.000.000 - - [14/Apr/2018:14:21:14 +0200] "GET /robots.txt HTTP/1.0" 301 462 "-" ""
  # alexschroeder.ch:443 000.000.000.000 - - [14/Apr/2018:09:21:43 +0200] "" 400 3963 "-" "-"
  # alexschroeder.ch:443 000.000.000.000 - - [14/Apr/2018:06:27:27 +0200] "-" 408 3973 "-" "-"
  # www.emacswiki.org:443 000.000.000.000 - - [14/Apr/2018:16:42:12 +0200] "GET /images/logo218x38.png HTTP/1.1" 200 3872 "-" "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:52.0) Gecko/20100101 Firefox/52.0"
  warn "Cannot parse: $_\n" unless $host;
  $hits++;
  $bandwidth += $bytes;
  my $domain;
  next unless $all or ($domain) = $agent =~ /([a-z0-9@+.]*bot[a-z0-9@+.]*)/i;
  my $key = $domain;
  $key = $1 if not $key and $agent =~ /https?:\/\/([^ \/()]+)/; # prefer just the domain of the bot
  $key ||= $agent; # fallback: everything
  $agent{$key}++;
  $bandwidth{$key} += $bytes;
  $bot_hits++;
  $bot_bandwidth += $bytes;

  my $date = parsedate($time);
  $first{$key} = $date unless $first{$key};
  $last{$key} = $date;

  if ($uri =~ /action=/i) {
    $actions++;
    $action{$key}++;
  }
}
my @result = sort {$agent{$b} <=> $agent{$a}} keys %agent;

print "    ----------------------------Bandwidth-------Hits-------Actions--Delay\n";
printf "%30s %9dM %10d\n", 'Everybody',
  $bandwidth / 1024 / 1024, $hits;
printf "%30s %9dM %10d   %3d%%   %3d%%\n", 'All Bots',
  $bot_bandwidth / 1024 / 1024, $bot_hits, 100, 100 * $actions / $bot_hits;
print "    ---------------------------------------------------------------------\n";
foreach my $key (@result) {
  my $avg = "";
  if ($first{$key} and $last{$key} and $agent{$key} > 1) {
    $avg = ($last{$key} - $first{$key}) / ($agent{$key} -1);
  }
  printf "%30s %9dK %10d   %3d%%   %3d%%   %3ds\n", $key,
      $bandwidth{$key} / 1024,
      $agent{$key},
      100 * $agent{$key} / $bot_hits,
      100 * $action{$key} / $agent{$key},
      $avg;
}

Ten years ago: 2008-10-09 How Many Bot Hits.

2008-10-09 How Many Bot Hits

​#Administration ​#Web

Comments

(Please contact me if you want to remove your comment.)

@dredmorbius said they had „long found clustering by ASN & CIDR block to be far more revealing.” So, time to learn about the Route Views Project?

@dredmorbius

Route Views Project

– Alex Schroeder 2018-08-27 18:24 UTC

---

People often don’t understand how all systems online are under constant automated attacks until they start looking at their web server logs and their ssh logins (if they run the service on the default port). My web server access log for the last 24h period shows 336 requests for the non-existent WordPress login page, for example.

# **awk '/wp/ {print $8}' < /var/log/apache2/access.log.1 | sort | uniq --count | sort --numeric | tail**
      5 //web/wp-includes/wlwmanifest.xml
      5 //wordpress/wp-includes/wlwmanifest.xml
      5 //wp1/wp-includes/wlwmanifest.xml
      5 //wp2/wp-includes/wlwmanifest.xml
      5 /wp-content/plugins/wp-file-manager/readme.txt
      5 //wp-includes/wlwmanifest.xml
      5 //wp/wp-includes/wlwmanifest.xml
      8 /wp-content/plugins/iva-business-hours-pro/assets/fontello/LICENSE.txt
     24 /wp-admin/css/
    336 /wp-login.php

Eight requests for some “iva business hours pro” plugin file? I guess they do this in order to find vulnerable servers?

– Alex 2021-08-13 13:23 UTC

---

Note that “wp” stands for “wordpress” here. Which shows another lesson: whatever you run your blog on, don’t use the most popular software for it.

– deshipu 2021-08-13 14:06 UTC