First day back home after a nice two week holiday in the mountains. I’m looking at my web server logs. 😠
I’m thinking about blocking all bots from my website. But where to start? How about this: Check the access.log file (I use Apache as my web server). If the User Agent Field contains the word “bot” that sounds like a candidate? Let’s see!
First, let’s pick 24h worth of log files.
grep ^alexschroeder /var/log/apache2/access.log.1 | wc -l 15875
This is yesterday’s log file and it has about 16k hits for my site.
perl -ne 'print "$1\n" if /"([^"]*bot[^"]*)"$/i' \ < /var/log/apache2/access.log.1 \ | sort | uniq -c | sort -n | tail
This takes a line from the access.log like this: `www.emacswiki.org:443 66.249.64.38 - - [16/Jul/2023:00:00:44 +0200] "GET /emacs/SmtpAuth HTTP/1.1" 200 11056 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"` and checks if the last thing in double quotes contains the word “bot”.
Manually massaging the last ten results:
26 Pleroma 36 DataForSeoBot 😒 47 Gwene 50 DotBot → Moz → SEO 😒 84 Googlebot-Image 😒 139 YandexBot 😒 189 Googlebot 😒 254 bingbot 😒 333 Googlebot 😒
Gwene has is a RSS to News thing but also has Googlebot in its name? Weird!
I’m already blocking all Pleroma, Mastodon and Friendica because of the useless previews they try to generate, but the hits still count, of course.
RewriteEngine on # Fediverse instances asking for previews: protect the expensive endpoints RewriteCond %{REQUEST_URI} /(wiki|cgit|download|food|paper|hug|helmut|input|korero|check|radicale|say|mojo|software) RewriteCond %{HTTP_USER_AGENT} Mastodon|Friendica|Pleroma [nocase] # then it's forbidden RewriteRule ^(.*)$ - [forbidden,last]
What’s weird is when I group by all my websites. I have about 163k hits for that log file I am looking at. If I count and group per site:
perl -ne 'print "$1 $2\n" if /^([^:]*).*"([^"]*bot[^"]*)"$/i' \ < /var/log/apache2/access.log.1 \ | sort | uniq -c | sort -n | tail
The last ten results, massaged:
592 campaignwiki.org Googlebot 😒 694 www.emacswiki.org YandexBot 😒 729 flying-carpet.ch MJ12bot 😒 959 www.emacswiki.org Gwene 😒 1737 www.emacswiki.org magpie-crawler 😒 1951 www.emacswiki.org bingbot 😒 2198 www.emacswiki.org Googlebot 😒 2807 www.emacswiki.org Googlebot 😒 2878 www.emacswiki.org EyeMonIT Uptime Bot 😒 13728 www.emacswiki.org SeekportBot 😒
So much shit that needs blocking!
OK, I don’t feel like manually editing the result list any more. And I don’t feel like looking at bot that I’m already blocking. The following checks that the log line contains a 200 surrounded by spaces (meaning an “OK” status response instead of some sort of error code), and it extracts the word containing “bot” and print it.
perl -ne 'print "$2\n" if / 200 .*"([^"]*?([a-z]*bot[a-z]*)[^"]*)"$/i' \ < /var/log/apache2/access.log.1 \ | sort | uniq -c | sort -n | tail 99 DotBot 123 MojeekBot 157 AwarioBot 283 SemanticScholarBot 430 Applebot 838 bot 1349 Bot 2603 bingbot 4524 Googlebot 8349 SeekportBot
A more interesting list! Let’s see.
An interesting question: What do you know about the Mojeek search engine? I like independent search engines!
What about this one? “Brand management made simple. Track the conversations about your business across social media, news, blogs, videos, forums, and reviews.” The Awario bot gets blocked for sure!
“Scholar” sounds great but this sounds like a scam: “The Semantic Scholar bot crawls certain domains to find academic PDFs. These PDFs are served on semanticscholar.org (opens in a new tab) so researchers can discover and understand other academic accomplishments.” The PDFs are served on a different domain? Sounds like copyright violation to me. Or perhaps the scientific journals are looking for pirate copies? In any case, since I don’t write academic PDFs, this bot gets kicked in the butt.
OK, so what about Seekport? It seems to be a German search engine. I like independent search engines. But if you click around a bit, you find this self-description: “Seit 2003 die zentrale Anlaufstelle für aktuelle SEO-News, tiefgründige Datenanalysen & Meinungen zu aktuellen Trends der Plattformökonomie.” The source of current SEO news since 2003… ?? Ugh! Bloooooock.
Right now, personal sites like my diary get this piece of code in their top-level “.htaccess” file since Google and company aren’t going to return my pages as part of their results, anyway. At least they won’t be used for AI training!
RewriteEngine on # Deny all bots RewriteCond %{HTTP_USER_AGENT} "bot" [nocase] RewriteRule ^ nobots.html [last]
The nobots page simply says to contact me if there is a problem.
The default for all my sites is going into “/etc/apache2/conf-enabled/blocklist.conf” and says:
RewriteEngine on # SEO bots and other shit (for Emacs Wiki) RewriteCond "%{HTTP_USER_AGENT}" "pcore|megaindex|semrushbot|wiederfrei|eyemonit|yandexbot|magpie-crawler|mj12bot|seekportbot|dotbot|awariobot|semanticscholarbot|seokicks-robot|ahrefsbot|trendictionbot|linkfluence|startmebot|dataforseobot" [nocase] RewriteRule ^(.*)$ - [forbidden,last]
This is what the non-personal sites like Emacs Wiki get.
Oh, and let’s not forget the people looking for misconfigured admin consoles written in PHP:
RewriteEngine on # Deny all idiots that are looking for borked PHP applications # Status Code 402 is "Payment Required". RewriteRule \.php$ - [redirect=402]
I really need to work on the code that blocks entire ASN.
#Web #Administration #Bots #Butlerian Jihad
(Please contact me if you want to remove your comment.)
⁂
Is any bot respecting robots.txt? Half serious question 🙂
– jjm 2023-07-16 19:33 UTC
---
I know that my blocking.conf file started because of bots that disregarded robots.txt files. Theses days I often wonder: where’s the easiest place to update a list just once and never having to deal with it again? And instant blocking seems like the easier solution sind I have Apache server config access on my server.
Also, some bots are listed here as not checking robots.txt at all, e.g. facebot:
Wird die robots.txt ausgelesen? Nein. – robots db
– Alex 2023-07-16 20:24 UTC
---
OK, what about the long tail, though?
perl -ne 'print "$2\n" if / 200 .*"([^"]*?([a-z]*bot[a-z]*)[^"]*)"$/i' < /var/log/apache2/access.log.1 | sort | uniq -c | sort -n 1 BaudBot 1 bobbinsrobots 1 bottle 1 cabotcove 1 FullStoryBot 1 GoogleBot 1 LivelapBot 1 PodheroBot 1 redditbot 1 robot 1 robotics 1 TelegramBot 1 URLSuMaBot 1 WebwikiBot 1 YandexBot 1 YandexRenderResourcesBot 2 googlebot 2 jaddjabot 2 Pinterestbot 2 PodBotLP 2 Robot 2 Semanticbot 2 Wibybot 2 ZumBot 3 DuckDuckBot 3 PixelFedBot 3 SeznamBot 3 SurdotlyBot 3 WellKnownBot 4 ArchiveBot 4 AwarioSmartBot 4 Newslitbot 4 SerendeputyBot 5 AcademicBotRTU 5 BitSightBot 5 Mediatoolkitbot 7 tapbots 7 TheFeedReaderBot 7 trendictionbot 8 Discordbot 8 SiteAuditBot 12 AhrefsBot 14 PetalBot 15 feedbot 16 BLEXBot 19 coccocbot 19 PaperLiBot 19 yacybot 22 botsin 24 startmebot 24 SummalyBot 26 Bingbot 29 ZoominfoBot 35 bots 36 Twitterbot 38 Elisabot 43 FeedlyBot 44 MojeekBot 50 DomainStatsBot 64 DataForSeoBot 77 Facebot 85 DotBot 429 SemanticScholarBot 467 bot 540 Applebot 932 Bot 2587 bingbot 5032 Googlebot 18369 SeekportBot
Some of them need investigation!
First, the bots that might be OK for “public service” sites like Emacs Wiki, Community Wiki, Campaign Wiki, Oddmuse.
Feed readers are OK as long as they don’t use AI and as long as they just get the feed.
Bots that get blocked:
– Alex 2023-07-16 20:59 UTC
---
I’m torn about search engines. On the one hand, they provide a service by allowing end users to search, but on the other hand, they are part of the surveillance capitalist societies, tracking visitors and selling information to companies. As time passes, the deal appears to be evermore in my disfavour: My sites cannot be found and yet they are used to train the large language models of our inane enshittified future where the world is full of garbage produced using them.
So, ideally, I only want to serve information to services that are entirely focused on the public service.
– Alex 2023-07-17 07:22 UTC
---
If you visit one of these search engines with Firefox, note how the Search bar changes to add a little green plus sign next to the magnifying glass. That’s how you add the search engine you’re looking at to your collection!
– Alex 2023-07-17 14:31 UTC
---
Years ago I wrote a tool to parse the Apache log files and print out fields requested (if you want, I can mail you a copy of the program). I just ran the tool over the log file from my blog for June:
[spc]brevard:~/web/logs.archive/2023/06>escanlog -agent boston.conman.org | sort | uniq -c | sort -rn | more 24915 Mozilla/5.0 (Windows NT 10.0; rv:78.0) Gecko/20100101 Firefox/78.0 21404 Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/) 18740 Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler) 12324 Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot) 9500 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) 9380 Mozilla/5.0 (compatible; SeekportBot; +https://bot.seekport.com) 8806 WF search/Nutch-1.12 8053 CommaFeed/2.6.0 (https://github.com/Athou/commafeed) 6657 Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 6558 Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0 6120 Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html) 5663 Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/103.0.5060.134 Safari/537.36 5464 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36 5078 Newsboat/2.31.0 (Linux x86_64) 4775 CCBot/2.0 (https://commoncrawl.org/faq/) 4450 Mozilla/5.0 (compatible; DotBot/1.2; +https://opensiteexplorer.org/dotbot; help@moz.com) 4196 Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/100.0.4889.0 Safari/537.36 3545 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15 (Applebot/0.1; +http://www.apple.com/go/applebot) 3170 Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36 2914 Mozilla/5.0 (compatible; Miniflux/2.0.44; +https://miniflux.app) 2766 Tiny Tiny RSS/23.05-a4543de (https://tt-rss.org/) 2324 Mozilla/5.0 (compatible; Barkrowler/0.9; +https://babbar.tech/crawler) 2126 Tiny Tiny RSS/UNKNOWN (Unsupported, Git error) (https://tt-rss.org/) 2107 Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; spider-feedback@bytedance.com) 2044 Mozilla/5.0 (compatible; Miniflux/2.0.43; +https://miniflux.app) 1985 facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
Not all of them use “bot” in their name, some use “crawler”, and at least one uses “spider”. I feel like banning bots is a lot like “whack-a-mole”—for every one you knock down, another pops up. For me, as long as they make valid requests, it’s fine, but I did get one of the worst offenders to stop crawling me.
one of the worst offenders to stop crawling me
– Sean Conner 2023-07-17 18:21 UTC
---
Oh yes, MJ12Bot has been on the banned list for ages!
– Alex 2023-07-17 20:24 UTC
---
First: I’m enjoying reading your comments and some sort of internal dialog 🙂
My approach is more relaxed, and I don’t take action unless there’s something affecting the service, which is very rare and I tend to throttle those via iptables (no matter if is a bot or not); which is reactive and not very effective.
Are you planning to collect this somewhere for people to find? I have found some blocklists for Apache and nginx, but they all seem too old to be useful, and then there’s https://github.com/mitchellkrogza/nginx-ultimate-bad-bot-blocker
https://github.com/mitchellkrogza/nginx-ultimate-bad-bot-blocker
– jjm 2023-07-18 07:21 UTC
---
A long time ago I had copied a similar list: 2020-12-22 Apache config file to block user agents. But I think at the time when browser versions reached 100 it started breaking and I decided to abolish it. So right now, all I have is this:
2020-12-22 Apache config file to block user agents
RewriteEngine on # Fediverse instances asking for previews: protect the expensive endpoints RewriteCond %{REQUEST_URI} /(wiki|cgit|download|food|paper|hug|helmut|input|korero|check|radicale|say|mojo|software) RewriteCond %{HTTP_USER_AGENT} Mastodon|Friendica|Pleroma [nocase] # then it's forbidden RewriteRule ^(.*)$ - [forbidden,last] # SEO bots and other shit (for Emacs Wiki) RewriteCond "%{HTTP_USER_AGENT}" "academicbotrtu|ahrefsbot|awariobot|bitsightbot|blexbot|dataforseobot|discordbot|domainstatsbot|dotbot|elisabot|eyemonit|facebot|linkfluence|magpie-crawler|megaindex|mediatoolkitbot|mj12bot|newslitbot|paperlibot|pcore|petalbot|pinterestbot|seekportbot|semanticscholarbot|semrushbot|semanticbot|seokicks-robot|siteauditbot|startmebot|summalybot|synapse|trendictionbot|twitterbot|wiederfrei|yandexbot|zoominfobot" [nocase] RewriteRule ^(.*)$ - [forbidden,last] # Deny all idiots that are looking for borked PHP applications # Status Code 402 is "Payment Required". RewriteRule \.php$ - [redirect=402] # Private sites block all bots and crawlers. This list does no include # social.alexschroeder.ch, communitywiki.org, www.emacswiki.org, # oddmuse.org, orientalisch.info, korero.org. RewriteCond "%{HTTP_HOST}" "^(alexschroeder\.ch|flying-carpet\.ch|next\.oddmuse\.org|((chat|talk)\.)?campaignwiki\.org|((archive|vault|toki|xn--vxagggm5c)\.)?transjovian\.org)$" [nocase] RewriteCond "%{HTTP_USER_AGENT}" "!archivebot|^gwene" [nocase] RewriteCond "%{HTTP_USER_AGENT}" "bot|crawler" [nocase] RewriteRule ^ https://alexschroeder.ch/nobots [redirect,last]
The last section can be part of a top-level “.htaccess” file if your site is static but in my case, the site config file has Apache act as a reverse proxy for a bunch of URLs such as “/wiki”. Thus, the request is handed off to a the wiki server before the “.htaccess” file in the document directory is read. In order to prevent that, the rules need to be in the global web server directory. And I think I like it better this way: all the blocking stuff is this one “blocklist.conf” file.
– Alex 2023-07-18 09:22 UTC
---
*Gwene has is a RSS to News thing but also has Googlebot in its name? Weird!*
I think that was added because some servers allow requests if that specific string is in the User-Agent, but not otherwise (probably also allowing “real browser”-User-Agents, but they are harder to guess/emulate).
Regarding whack-a-mole: I use fail2ban to block obvious offenders automatically - I match on requests for eg anything with .php in the URL, Wordpress admin URLs, failed blog comment attempts, script kiddie-requests, etc.
In the big picture is makes no difference, but it runs automatically and it feels nice every time I see a bot “banned”.
– Adam Sjøgren 2023-07-18 10:26 UTC
---
Yeah! I also use fail2ban to ban IP numbers that request too many pages for a certain time window: 2019-01-20 fail2ban to watch over my sites.
2019-01-20 fail2ban to watch over my sites
In “/etc/fail2ban/filter.d/alex-apache.conf”:
[Definition] # ANY match in the logfile counts! failregex = ^[^:]+:[0-9]+ <HOST>
And “/etc/fail2ban/jail.d/alex.conf”:
[alex-apache] enabled = true port = http,https logpath = %(apache_access_log)s findtime = 40 maxretry = 20 [recidive] enabled = true
The effect is that anybody clicking “too fast” (more than 20 pages in 40 seconds) gets banned for 10min; people who get banned this way for three times in one day get banned for one week. (These are the default “recidive” setting.)
The only thing I’d like to add is that if three IP numbers from the same ASN are banned, the entire ASN should be banned for a week.
– Alex 2023-07-18 10:39 UTC
---
Ooooh, this is terrible, so many bots after bots after bots... I see Applebot, what’s that?
As for Petal, I am surprised that it does not work right now, it is the default search engine on Huawei phones right now - not sure if evil, but chinese, that’s for sure 😀
– Peter Kotrčka 2023-07-18 21:38 UTC
---
Applebot is the web crawler for Apple. Products like Siri and Spotlight Suggestions use Applebot. – About Applebot
If I click on petalsearch.com, nothing happens – pings are returned by there is no website coming up. I don’t know. Perhaps they’re checking the user agent string, too, answering only for Huawei phones? 😀
– Alex 2023-07-19 06:21 UTC
---
@sam posts:
User-agent: GPTBot Disallow: /
– Alex 2023-08-12 12:38 UTC