I guess machines aren’t ruling the world, yet. But bots are close to eating out web. Check out my robots.txt file. I want crawlers to use a long delay. I disallow many of them. And I still have 20% bot traffic, in a random 24h period (from `26/Aug/2018:06:25:11 +0200` to `27/Aug/2018:06:25:07 +0200`, so mostly on a Sunday).
# cat /var/log/apache2/access.log.1 | /home/alex/bin/bot-detector | head ----------------------------Bandwidth-------Hits-------Actions--Delay Everybody 6728M 86836 All Bots 729M 17908 100% 4% --------------------------------------------------------------------- Applebot 27431K 5068 28% 0% 17s bingbot 510446K 5015 28% 1% 17s Googlebot 34651K 3113 17% 15% 27s YandexBot 75768K 2753 15% 1% 31s DotBot 1266K 444 2% 0% 193s SeznamBot 226K 252 1% 1% 334s
Notice that across all IP numbers, abblebot and bingbot are using less than te 20s of crawl delay I ask for in my `robots.txt`.
The bot detector is very simple. Basically I’m looking for a word containing “bot” in the user agent field.
#!/usr/bin/perl use strict; use Time::ParseDate; my $all = grep /--all/, @ARGV; my %agent; my %action; my %bandwidth; my $actions; my $hits; my $bandwidth; my $bot_bandwidth; my $bot_hits; my %first; my %last; while (<STDIN>) { # condense one or more whitespace character to one single space s/\s+/ /go; # break each apache access_log record into nine variables my ($host, $address, $rfc1413, $username, $time, $request, $status, $bytes, $referer, $agent) = /^(\S+) (\S+) (\S+) (\S+) \[(.+)\] \"(.*)\" (\S+) (\S+) \"(.*)\" \"(.*)\"/; # determine the value of $uri my ($method, $uri, $junk) = split(' ', $request, 3); # campaignwiki.org:80 000.000.000.000 - - [14/Apr/2018:14:21:14 +0200] "GET /robots.txt HTTP/1.0" 301 462 "-" "" # alexschroeder.ch:443 000.000.000.000 - - [14/Apr/2018:09:21:43 +0200] "" 400 3963 "-" "-" # alexschroeder.ch:443 000.000.000.000 - - [14/Apr/2018:06:27:27 +0200] "-" 408 3973 "-" "-" # www.emacswiki.org:443 000.000.000.000 - - [14/Apr/2018:16:42:12 +0200] "GET /images/logo218x38.png HTTP/1.1" 200 3872 "-" "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:52.0) Gecko/20100101 Firefox/52.0" warn "Cannot parse: $_\n" unless $host; $hits++; $bandwidth += $bytes; my $domain; next unless $all or ($domain) = $agent =~ /([a-z0-9@+.]*bot[a-z0-9@+.]*)/i; my $key = $domain; $key = $1 if not $key and $agent =~ /https?:\/\/([^ \/()]+)/; # prefer just the domain of the bot $key ||= $agent; # fallback: everything $agent{$key}++; $bandwidth{$key} += $bytes; $bot_hits++; $bot_bandwidth += $bytes; my $date = parsedate($time); $first{$key} = $date unless $first{$key}; $last{$key} = $date; if ($uri =~ /action=/i) { $actions++; $action{$key}++; } } my @result = sort {$agent{$b} <=> $agent{$a}} keys %agent; print " ----------------------------Bandwidth-------Hits-------Actions--Delay\n"; printf "%30s %9dM %10d\n", 'Everybody', $bandwidth / 1024 / 1024, $hits; printf "%30s %9dM %10d %3d%% %3d%%\n", 'All Bots', $bot_bandwidth / 1024 / 1024, $bot_hits, 100, 100 * $actions / $bot_hits; print " ---------------------------------------------------------------------\n"; foreach my $key (@result) { my $avg = ""; if ($first{$key} and $last{$key} and $agent{$key} > 1) { $avg = ($last{$key} - $first{$key}) / ($agent{$key} -1); } printf "%30s %9dK %10d %3d%% %3d%% %3ds\n", $key, $bandwidth{$key} / 1024, $agent{$key}, 100 * $agent{$key} / $bot_hits, 100 * $action{$key} / $agent{$key}, $avg; }
Ten years ago: 2008-10-09 How Many Bot Hits.
#Administration #Web
(Please contact me if you want to remove your comment.)
⁂
@dredmorbius said they had „long found clustering by ASN & CIDR block to be far more revealing.” So, time to learn about the Route Views Project?
– Alex Schroeder 2018-08-27 18:24 UTC
---
People often don’t understand how all systems online are under constant automated attacks until they start looking at their web server logs and their ssh logins (if they run the service on the default port). My web server access log for the last 24h period shows 336 requests for the non-existent WordPress login page, for example.
# **awk '/wp/ {print $8}' < /var/log/apache2/access.log.1 | sort | uniq --count | sort --numeric | tail** 5 //web/wp-includes/wlwmanifest.xml 5 //wordpress/wp-includes/wlwmanifest.xml 5 //wp1/wp-includes/wlwmanifest.xml 5 //wp2/wp-includes/wlwmanifest.xml 5 /wp-content/plugins/wp-file-manager/readme.txt 5 //wp-includes/wlwmanifest.xml 5 //wp/wp-includes/wlwmanifest.xml 8 /wp-content/plugins/iva-business-hours-pro/assets/fontello/LICENSE.txt 24 /wp-admin/css/ 336 /wp-login.php
Eight requests for some “iva business hours pro” plugin file? I guess they do this in order to find vulnerable servers?
– Alex 2021-08-13 13:23 UTC
---
Note that “wp” stands for “wordpress” here. Which shows another lesson: whatever you run your blog on, don’t use the most popular software for it.
– deshipu 2021-08-13 14:06 UTC