I looked at my web logs again. I have a script that looks at the user agent entry in my logs and uses the following regular expression to figure out what sort of bot we’re looking at:
/([a-z0-9@+.]*bot[a-z0-9@+.]*)/i
Thus, some sort of word or email address containing the word “bot”. Given a user agent like the following, the script counts that as a hit for “serpstatbot”.
"serpstatbot/1.0 (advanced backlink tracking bot; http://serpstatbot.com/; abuse@serpstatbot.com)"
Here’s data showing that 21% of my hits are bots (18253 / 88862). Of these, 20% are by the Google bot, 19% are by the Bing bot, 10% are by the Yandex bot, 5% are by the Apple bot, and so on. And that is considering a long robots.txt file!
Stupid bots! 😠
I’m going to add `serpstatbot` to my `robots.txt` files.
Notice the entry that just says `bot`. How stupid is that? One possible culprit is this one, as “bot” is the first word in the user agent matching the string “bot”. Thanks for nothing, Quant.
Mozilla/5.0 (compatible; Qwantify/Bleriot/1.1; +https://help.qwant.com/bot)
And now for the data:
# /home/alex/bin/bot-detector < /var/log/apache2/access.log ----------------------------Bandwidth-------Hits-------Actions--Delay Everybody 3417M 88862 All Bots 767M 18253 100% 9% --------------------------------------------------------------------- Googlebot 62588K 3818 20% 21% 15s bingbot 493730K 3540 19% 6% 17s serpstatbot 15398K 2965 16% 3% 20s YandexBot 148204K 1949 10% 0% 30s Applebot 5840K 916 5% 0% 66s CCBot 6880K 769 4% 21% 78s bot 14858K 755 4% 5% 79s +centurybot9@gmail.com 5086K 567 3% 2% 106s DotBot 959K 351 1% 0% 171s chimebot 14805K 266 1% 0% 228s Gigabot 2246K 245 1% 0% 248s Exabot 1994K 238 1% 29% 254s SemrushBot 633K 196 1% 0% 304s Slackbot 1203K 179 0% 95% 338s robots 1643K 146 0% 0% 408s ZoominfoBot 1110K 134 0% 0% 447s Cliqzbot 577K 133 0% 0% 436s BLEXBot 189K 131 0% 0% 435s robot.html 947K 111 0% 0% 417s Bot 843K 104 0% 0% 579s istellabot 283K 88 0% 14% 285s trendictionbot0.5.0 301K 61 0% 39% 982s PaperLiBot 455K 52 0% 50% 1087s DomainStatsBot 1199K 47 0% 14% 137s MagiBot 400K 46 0% 0% 1309s Twitterbot 256K 43 0% 0% 1264s MojeekBot 88K 41 0% 2% 1183s rogerbot 197K 40 0% 0% 1314s bots 207K 40 0% 0% 1396s SEMrushBot 105K 36 0% 0% 1594s coccocbot 149K 27 0% 0% 1748s yacybot 78K 20 0% 5% 2276s RSSingBot 131K 19 0% 0% 2561s MJ12bot 43K 16 0% 0% 3706s BoxcarBot 6K 14 0% 100% 4479s SMTBot 129K 13 0% 0% 8s ICBot 40K 12 0% 0% 2256s bot@linkfluence.com 85K 11 0% 0% 4936s Uptimebot 45K 10 0% 0% 5505s SurdotlyBot 30K 8 0% 0% 0s YandexAccessibilityBot 63K 7 0% 0% 5470s TweetmemeBot 37K 6 0% 0% 4566s feedbot 639K 6 0% 33% 7430s Laserlikebot 19K 5 0% 0% 9075s ZumBot 72K 5 0% 0% 1s oBot 15K 5 0% 0% 3495s Mediatoolkitbot 49K 4 0% 0% 10673s startmebot 38K 4 0% 0% 1991s toot.robot 16K 4 0% 0% 1s YandexMobileBot 46K 4 0% 0% 13973s AhrefsBot 9K 4 0% 0% 4874s DuckDuckBot 17K 4 0% 0% 1s SabsimBot 60K 4 0% 25% 3773s ZoomBot 17K 4 0% 0% 0s wiederfreibot 36K 3 0% 66% 21665s OutclicksBot 1K 3 0% 0% 6324s TelegramBot 28K 3 0% 0% 9789s bot.html 46K 3 0% 0% 15600s bitlybot 11K 2 0% 0% 0s botsin.space 7K 2 0% 0% 2114s BublupBot 8K 2 0% 0% 23512s robots.txt 9K 2 0% 0% 0s AwarioRssBot 28K 2 0% 100% 3831s Discordbot 50K 2 0% 0% 9774s redditbot 4K 2 0% 0% 1s newsbots.eu 4K 1 0% 0% 0s OnalyticaBot 4K 1 0% 0% 0s LivelapBot 10K 1 0% 0% 0s Facebot 219K 1 0% 0% 0s
Remember, that already takes into account all the bots that don’t crawl my sites because of `robots.txt`.
“Actions” are those URLs that contain a query parameter called “action” as this is an indication for misbehaving bots that follow links they shouldn’t follow.
“Delay” is there to show whether the bot observes the crawl delay I specify in my `robots.txt`.
If you look at the source code, you’ll see that my log files no longer contain any IP numbers. That’s how I try to protect the privacy of my visitors, even against myself. 🙄
Source code:
#!/usr/bin/perl use strict; use Time::ParseDate; my $all = grep /--all/, @ARGV; my %agent; my %action; my %bandwidth; my $actions; my $hits; my $bandwidth; my $bot_bandwidth; my $bot_hits; my %first; my %last; while (<STDIN>) { # example line from my log file # www.emacswiki.org:443 - [25/Jun/2019:14:01:04 +0200] "GET /images/logo218x38.png HTTP/1.1" 200 3919 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/75.0.3770.90 Chrome/75.0.3770.90 Safari/537.36" # parse line m/^(\S+:\d+) (-|admin) \[(.*?)\] "(.*?)" (\d+) (\d+|-) "(.*?)" "(.*?)"/ or warn "Cannot parse:\n$_" and next; my ($host, $user, $time, $request, $code, $bytes, $referrer, $agent) = ($1, $2, $3, $4, $5, $6, $7, $8, $9); # determine the value of $uri my ($method, $uri, $junk) = split(' ', $request, 3); warn "Cannot parse: $_\n" unless $host; $hits++; $bandwidth += $bytes; my $domain; next unless $all or ($domain) = $agent =~ /([a-z0-9@+.]*bot[a-z0-9@+.]*)/i; my $key = $domain; $key = $1 if not $key and $agent =~ /https?:\/\/([^ \/()]+)/; # prefer just the domain of the bot $key ||= $agent; # fallback: everything $agent{$key}++; $bandwidth{$key} += $bytes; $bot_hits++; $bot_bandwidth += $bytes; my $date = parsedate($time); $first{$key} = $date unless $first{$key}; $last{$key} = $date; if ($uri =~ /action=/i) { $actions++; $action{$key}++; } } my @result = sort {$agent{$b} <=> $agent{$a}} keys %agent; print " ----------------------------Bandwidth-------Hits-------Actions--Delay\n"; printf "%30s %9dM %10d\n", 'Everybody', $bandwidth / 1024 / 1024, $hits; printf "%30s %9dM %10d %3d%% %3d%%\n", 'All Bots', $bot_bandwidth / 1024 / 1024, $bot_hits, 100, 100 * $actions / $bot_hits; print " ---------------------------------------------------------------------\n"; foreach my $key (@result) { my $avg = ""; if ($first{$key} and $last{$key} and $agent{$key} > 1) { $avg = ($last{$key} - $first{$key}) / ($agent{$key} -1); } printf "%30s %9dK %10d %3d%% %3d%% %3ds\n", $key, $bandwidth{$key} / 1024, $agent{$key}, 100 * $agent{$key} / $bot_hits, 100 * $action{$key} / $agent{$key}, $avg; }
#Bots #Administration #Butlerian Jihad
Blocking robots on your web page – the list of 1800 bad bots
One the one hand, wow! So much block! On the other hand, reading the comments it is obvious that there will be the occasional false positive in that list, and that make me wary, and weary.
Apache config file to block user agents
49.6% of all internet traffic came from bots in 2023, a 2% increase over the previous year, and the highest level Imperva has reported since it began monitoring automated traffic in 2013. – Bots dominate internet activity, account for nearly half of all traffic
Bots dominate internet activity, account for nearly half of all traffic