2023-12-10 Bots, again

I come home from a friendly meetup and notice that my tiny web-server has a load of 80 instead of the usual 0.5. What the hell is going on? I look at the logs of the last 24 hours and see an IP number with more than 100 000 hits in the last 24h. What are they doing?

Whois tells me it is from the "Alibaba Cloud". Oh yeah? What are the Chinese trying to do on my site?

I start poking around. More and more IP numbers from all over the net show up. Alibaba Cloud, Tencent Cloud.

All right, so I'm blocking some of them individually as I go, but after a while I realize that I probably have to block them at the network level.

For the moment I'm also taking down one of the wikis that's overloading my server.

Just looking at the top 10 offenders for two of my domains, and running whois on them to find the entire network they belong to, and checking that it's Alibaba or Tencent:

# Alibaba Cloud 2023-12-10
RewriteCond "%{REMOTE_ADDR}" "-R '47.76.0.0/14'" [or]
RewriteCond "%{REMOTE_ADDR}" "-R '47.80.0.0/13'" [or]
RewriteCond "%{REMOTE_ADDR}" "-R '47.74.0.0/14'" [or]
RewriteCond "%{REMOTE_ADDR}" "-R '47.235.0.0/16'" [or]
RewriteCond "%{REMOTE_ADDR}" "-R '47.246.0.0/16'" [or]
RewriteCond "%{REMOTE_ADDR}" "-R '47.244.0.0/15'" [or]
RewriteCond "%{REMOTE_ADDR}" "-R '47.240.0.0/14'" [or]
RewriteCond "%{REMOTE_ADDR}" "-R '47.236.0.0/14'" [or]
# Tencent Cloud
RewriteCond "%{REMOTE_ADDR}" "-R '42.192.0.0/15'" [or]
RewriteCond "%{REMOTE_ADDR}" "-R '49.232.0.0/14'" [or]
RewriteCond "%{REMOTE_ADDR}" "-R '101.34.0.0/15'" [or]
RewriteCond "%{REMOTE_ADDR}" "-R '43.142.0.0/16'" [or]
RewriteCond "%{REMOTE_ADDR}" "-R '124.220.0.0/14'"
RewriteRule ^ https://alexschroeder.ch/nobots [redirect=410,last]

If I want this to be for the entire server and not repeat it for each location, I guess I'll have to use Apache rewrite rules.

While I'm doing this, I notice a new pattern… My wiki software allows you to fetch a feed for every page. Either it contains updates to the page (Oddmuse), or a feed of the pages linked (Oddmu). It's for humans.

Of course some shit engineer decided that it was a good idea to scan the web for all the feeds that are out there (so rare! so precious!) and to download them all, forever (uncover the darknet! server our customers!) and now I have to block IP number ranges, add robot agents to robots.txt files (not all of them provide one), or block user agents (not all of them provide a useful one) and I block and block and block (for the environment! to avoid +2.0°C and the end of human civilization!) and all this while I know that all these shit requests exist out there, for all the sites, everywhere – a hundred thousand requests or more per day, per site, wasting CO₂ – and what am I going to do, kill the feeds for humans because some shit engineer decided to feed a machine?

I'm on the Butlerian Jihad again.

Butlerian Jihad

Oh, and Virgin Media is downloading tons of PDFs I'm hosting? Are they looking for copyright violations? On the blocklist they go.

And what's this, Feedly is also downloading feeds like crazy, every few minutes? Slow down, idiots. My news is not important. On the blocklist they go. Or are you trying to train your stupid intelligence? Fuck this AI training stuff. I already use "X-Robots-Tag: noimageai" but I guess I should add even more HTTP headers to block even more engineers overstepping boundaries?

Ah, and MonitoRSS going into overdrive, from the Amazon Cloud. Really, I don't think there are humans in the Amazon Cloud. Onto the blocklist they go. Well, at least this IP range.

And who's that VelenPublicWebCrawler, zealously collecting pages? Onto the blocklists they go.

Following a lead from StackExchange and looking at the `ipset` manual I see that the type `hash:net` supports banning entire networks!

Here's a `fish` script:

#! /usr/bin/fish

# Use hash:net because of the CIDR stuff
ipset create banlist hash:net
iptables -I INPUT -m set --match-set banlist src -j DROP
iptables -I FORWARD -m set --match-set banlist src -j DROP

# Alibaba 2023-12-10
set -l networks \
    47.76.0.0/14 \
    47.80.0.0/13 \
    47.74.0.0/14 \
    47.235.0.0/16 \
    47.246.0.0/16 \
    47.244.0.0/15 \
    47.240.0.0/14 \
    47.236.0.0/14

# Tencent 2023-12-10
set -a networks \
    42.192.0.0/15 \
    49.232.0.0/14 \
    101.34.0.0/15 \
    43.142.0.0/16 \
    124.220.0.0/14

# OVH 2023-12-10 (Borked Feed Reader)
set -a networks 141.94.0.0/16

# Bell Canada 2023-12-10 (Borked Feed Reader)
set -a networks 142.177.0.0/16
# Amazon 2023-12-10 (MonitoRSS?)
set -a networks 44.192.0.0/11
# Virgin Media 2023-12-10
set -a networks 81.108.0.0/15
# BSO 2023-12-10 (maybe Feedly?)
set -a networks 8.29.192.0/21
# ASAHI 2023-12-10 (idiot bot getting 1PDC PDF repeatedly)
set -a networks 220.146.0.0/16

for network in $networks
    ipset add banlist $network
end

​#Bots ​#Butlerian Jihad ​#Administration

See ban-cidr for the latest version of that script.

ban-cidr

I don’t like the idea of my work being hoovered up to train “AI” data models; I don’t like that these companies assume my content’s available to them by default, and that I have to opt out of their scraping; I really don’t want anything I write to support these platforms, which I find unethical, extractive, deeply amoral, and profoundly anti-human. … I polled a few different sources to build a list of currently-known crawler names. … If the crawler’s name appears anywhere in that gnarly-looking list of user agents, my site should block it. – Blockin’ bots.

Blockin’ bots.

ban-cidr

# Huwaei-Cloud
set -a networks 101.44.248.0/22 \
    124.243.128.0/18 \
    49.0.200.0/21 \
    159.138.0.0/16 \
    94.74.80.0/20 \
    190.92.192.0/19 \
    101.44.248.0/22 \
    114.119.172.0/22