2025-03-21 A summary of my bot defence systems

If you've followed my Butlerian Jihad pages, you know that I'm constantly fiddling with the setup. Each page got written in the middle of an attack as I'm trying to save my sites, documenting as I go along. But if you're looking for an overview, there is nothing to see. It's all over the place. Since the topic has gained some traction in recent days, I'm going to assemble all the things I do on this page.

Butlerian Jihad

Here's Drew DeVault complaining about the problem that system administrators have been facing for a while, now:

If you think these crawlers respect `robots.txt` then you are several assumptions of good faith removed from reality. These bots crawl everything they can find, `robots.txt` be damned, including expensive endpoints like git blame, every page of every git log, and every commit in every repo, and they do so using random User-Agents that overlap with end-users and come from tens of thousands of IP addresses – mostly residential, in unrelated subnets, each one making no more than one HTTP request over any time period we tried to measure – actively and maliciously adapting and blending in with end-user traffic and avoiding attempts to characterize their behavior or block their traffic. -- Please stop externalizing your costs directly into my face, by Drew DeVault, for SourceHut

Please stop externalizing your costs directly into my face

I had read some similar reports before, on fedi, but this one links to quite a few of them: FOSS infrastructure is under attack by AI companies, by Niccolò Venerandi, for LibreNews.

FOSS infrastructure is under attack by AI companies

I'm going to skip the defences against spam as spam hasn't been a problem in recent months, surprisingly.

The first defence against bots is `robots.txt`. All well-behaving bots should read it every now and then and then either stop crawling the site or slow down.

Let's look at the file for Emacs Wiki.

for Emacs Wiki

If I find that there are lot of requests from a particular user agent that looks like bot, and it has a URL where I can find instructions for how to address it in `robots.txt`, this is what I do. I tell them to stop crawling the entire site. Most of these are search engine optimizers, brand awareness monitors and other such creeps.

The file also tells all well-behaving crawlers to slow down to a glacial tempo and it lists all the expensive endpoints that they should not be crawling at all. Conversely, this means that any bot that still crawls those URLs is a misbehaving bot and deserves to be blocked.

Worth noting, perhaps, that "an expensive endpoint" means a URL that runs some executable to do something complicated, resulting in an answer that's always different. If the URL causes the web server to run a CGI script, for example, the request loads Perl, loads a script, loads all its libraries, compiles it all, runs it once, and answers with the request with the output. And since the answer is dynamic, it can't very well be cached, or additional complexity needs to be introduced and even more resources need to be allocated and paid for. In short, an expensive end-point is like loading an app. It's slow but useful, if done rarely. So you'd do this for a human, for example. It's a disaster if bots swarm all over the site, clicking on every link.

It's also worth noting that not all my sites have the same expensive endpoints and so the second half of `robots.txt` can vary. Which makes maintenance of the first half a chore. I have a little script that allows me to add one bot to "all" the files, but it's annoying to have to do that. And I recently just copied a list from an AI / LLM User-Agents: Blocking Guide.

AI / LLM User-Agents: Blocking Guide

I use Apache as my web-server and I have a bunch of global configuration files to handle misbehaving bots and crawlers.

This example blocks fediverse agents from accessing my site. That's because whenever anybody post a URL to one of my sites, within the next 60 seconds, all the servers with users getting a copy of the URL will fetch a preview. That means hundreds of hits. This is particularly obnoxious for expensive endpoints. This response here tells them that they are forbidden from accessing the page.

# Fediverse instances asking for previews: protect the expensive endpoints
RewriteCond %{REQUEST_URI} /(wiki|download|food|paper|hug|helmut|input|korero|check|radicale|say|mojo|software)
RewriteCond %{HTTP_USER_AGENT} Mastodon|Friendica|Pleroma [nocase]
# then it's forbidden
RewriteRule ^(.*)$ - [forbidden,last]

These are the evil bots that self-identify as a bot but don't seem to heed the `robots.txt` files. These are all told that whatever page they were looking for, it's now gone (410). And if there's a human looking at the output, it even links to an explanation. Adding new user agents to this list is annoying because I need to connect as root and restart the web server after making any changes.

# SEO bots, borked feed services and other shit
RewriteCond "%{HTTP_USER_AGENT}" "academicbotrtu|ahrefsbot|amazonbot|awariobot|bitsightbot|blexbot|bytespider|dataforseobot|discordbot|domainstatsbot|dotbot|elisabot|eyemonit|facebot|linkfluence|magpie-crawler|megaindex|mediatoolkitbot|mj12bot|newslitbot|paperlibot|pcore|petalbot|pinterestbot|seekportbot|semanticscholarbot|semrushbot|semanticbot|seokicks-robot|siteauditbot|startmebot|summalybot|synapse|trendictionbot|twitterbot|wiederfrei|yandexbot|zoominfobot|velenpublicwebcrawler|gpt|\bads|feedburner|brandwatch|openai|facebookexternalhit|yisou|docspider" [nocase]
RewriteRule ^ https://alexschroeder.ch/nobots [redirect=410,last]

For some of my sites, I disallow all user agents containing the words "bot", "crawler", "spider", "ggpht" or "gpt" with the exception of "archivebot" and "wibybot" because these two bots I want to give access. Again, these bots are all told that whatever page they were looking for, it's now gone (410).

# Private sites block all bots and crawlers. This list does no include
# social.alexschroeder.ch, communitywiki.org, www.emacswiki.org,
# oddmuse.org, orientalisch.info, korero.org.
RewriteCond "%{HTTP_HOST}" "^((src\.)?alexschroeder\.ch|flying-carpet\.ch|next\.oddmuse\.org|((chat|talk)\.)?campaignwiki\.org|((archive|vault|toki|xn--vxagggm5c)\.)?transjovian\.org)$" [nocase]
RewriteCond "%{HTTP_USER_AGENT}" "!archivebot|^gwene|wibybot" [nocase]
RewriteCond "%{HTTP_USER_AGENT}" "bot|crawler|spider|ggpht|gpt" [nocase]
RewriteRule ^ https://alexschroeder.ch/nobots [redirect=410,last]

I also eliminate a lot of bots looking for PHP endpoints. I can do this because I know that I don't have any PHP application installed.

# Deny all idiots that are looking for borked PHP applications
RewriteRule \.php$ https://alexschroeder.ch/nobots [redirect=410,last]

There's also one particular image scraper that's using a unique string in its user agent.

# Deny the image scraper
# https://imho.alex-kunz.com/2024/02/25/block-this-shit/
RewriteCond "%{HTTP_USER_AGENT}" "Firefox/72.0" [nocase]
RewriteRule ^ https://alexschroeder.ch/nobots [redirect=410,last]

Next, all requests get logged by Apache in the `access.log` file. I use `fail2ban` to check this logfile. This is somewhat interesting because `fail2ban` is usually used to check for failed ssh login attempts. Those IP numbers that fail to login in a few times are banned. What I'm doing is I wrote a filter that treats every hit on the web server as a "failed login attempt".

This is the filter:

[Definition]
# Most sites in the logfile count! What doesn't count is fedi.alexschroeder.ch, or chat.campaignwiki.org.
failregex = ^(www\.)?(alexschroeder\.ch|campaignwiki\.org|communitywiki\.org|emacswiki\.org|flying-carpet\.ch|korero\.org|oddmuse\.org|orientalisch\.info):[0-9]+ <HOST> 

# Except css files, images...
ignoreregex = ^[^"]*"(GET /(robots\.txt |favicon\.ico |[^/ \"]+\.(css|js) |[^\"]*\.(jpg|JPG|png|PNG) |css/|fonts/|pdfs/|txt/|pics/|export/|podcast/|1pdc/|static/|munin/|osr/|indie/|rpg/|face/|traveller/|hex-describe/|text-mapper/|contrib/pics/|roll/|alrik/|wiki/download/)|(OPTIONS|PROPFIND|REPORT) /radicale)

And this is the jail, saying that any IP number may make 30 hits in 60 seconds. If an IP number exceeds this (2s per page!) then it gets blocked at the firewall for 10 minutes.

[alex-apache]
enabled = true
port    = http,https
logpath = %(apache_access_log)s
findtime = 60
maxretry = 30

I also have another filter for a particular substring in URLs that I found the bots are requesting all the time:

[Definition]
failregex = ^(www\.emacswiki\.org|communitywiki\.org|campaignwiki\.org):[0-9]+ <HOST> .*rcidonly=

The corresponding jail says that when you trigger request such a URL for the third time in an hour, you're blocked at the firewall for 10 minutes.

[alex-bots] enabled = true port = http,https logpath = %(apache_access_log)s findtime = 3600 maxretry = 2

(At the same time, these URL's redirect to a warning so that humans know that this is a trap.)

a warning

Furthermore, `fail2ban` also comes with a `recidive` filter that watches its own logs. If an IP has been banned five times in a day, it gets banned for a week.

[recidive]
enabled = true

To add to the `alex-bots` jail, here's what my Apache configuration says: RSS feeds for single pages are errors.

RewriteCond %{QUERY_STRING} action=rss
RewriteCond %{QUERY_STRING} rcidonly=.*
RewriteRule .* /error.rss [last]

Note that all my sites also use the following headers, so anybody ignoring these is also a prime candidate for blocking.

# https://github.com/rom1504/img2dataset#opt-out-directives
Header always set X-Robots-Tag: noai
Header always set X-Robots-Tag: noimageai

All of the above still doesn't handle extremely distributed attacks. In such situations, almost all IP numbers are unique. What I try to do in this situation is block the entire IP range that they come from. I scan the `access.log` for IP numbers that connected to a URL that shouldn't be used by bots because of `robots.txt`, containing `rcidonly` because I know humans will very rarely click it and it's expensive to serve. For each such IP number, I determine the IP range they come from, and then I block it all.

Basically, this is what I keep repeating:

# prefix with a timestamp
date
# log some candidates without whois information, skipping my fedi instance
tail -n 2000 /var/log/apache2/access.log \
 | grep -v ^social \
 | grep "rcidonly" \
 | bin/admin/network-lookup-lean > result.log
# count
grep ipset result.log|wc -l
# add
grep ipset result.log|sh
# document
grep ipset result.log>>bin/admin/ban-cidr

You can find the scripts in my admin collection.

admin collection

​#Administration ​#Butlerian Jihad

So by my count I already had to unblock three networks on my list. It's not a great solution, to be honest. And it doesn't expire, either. The list still contains 47021 IP ranges.