2025-03-20 Something about the bot defence is working

At midnight, there was a surge in activity. CPU usage went up.

2025-03-20-bot-defence-1.jpg

Load went up, too. But it stayed within reasonable bounds -- less than 4 instead of the more than 80 I have seen in the past.

2025-03-20-bot-defence-2.jpg

And the number of IP addresses blocked by `fail2ban` went from 40 to 50.

2025-03-20-bot-defence-3.jpg

I'm usually sceptical of this because the big attacks are from a far wider variety of IP numbers. In this case, however, maybe there was some probing that resulted in blocks? I don't know. Lucky, I guess?

In any case, the site is still up. Yay for small wins.

Also, I cannot overstate how good it feel to have some Munin graphs available.

Munin

`alex-bots` is a setup I desribed in 2025-02-19 Bots again, cursed. Basically a request to one of my Oddmuse wikis containing the parameter `rcidonly` is an expensive endpoint: "all changes for this single page" or "a feed for this single page". This is something a human would rarely access and yet it somehow the URLs landed in some dataset for AI training, I suspect. So what I do is I’m redirecting any request containing “rcidonly” in the query string to `/nobots`, warning humans not to click on these links.

2025-02-19 Bots again, cursed

In addition to that, the filter `/etc/fail2ban/filter.d/alex-bots.conf` contains this:

[Definition]
failregex = ^(www\.emacswiki\.org|communitywiki\.org|campaignwiki\.org):[0-9]+ <HOST> .*rcidonly=

And I added a section using this filter to my jail `/etc/fail2ban/jail.d/alex.conf`:

[alex-bots]
enabled = true
port    = http,https
logpath = %(apache_access_log)s
findtime = 3600
maxretry = 2

So if an IP number visits three URLs containing "rcidonly" in an hour, they get banned for ten minutes.

The `recidive` filter (a standard filter you just need to activate) then makes sure that any IP number that got blocked three times gets blocked for a week.

#Administration #Butlerian Jihad

*2025-03-20**. Ever since Drew DeVault published his blog post, more people seem to notice what's going on: AI ingestion is killing web sites and web services.

If you think these crawlers respect `robots.txt` then you are several assumptions of good faith removed from reality. These bots crawl everything they can find, `robots.txt` be damned, including expensive endpoints like git blame, every page of every git log, and every commit in every repo, and they do so using random User-Agents that overlap with end-users and come from tens of thousands of IP addresses – mostly residential, in unrelated subnets, each one making no more than one HTTP request over any time period we tried to measure – actively and maliciously adapting and blending in with end-user traffic and avoiding attempts to characterize their behavior or block their traffic. -- Please stop externalizing your costs directly into my face, by Drew DeVault, for SourceHut

Then, yesterday morning, KDE GitLab infrastructure was overwhelmed by another AI crawler, with IPs from an Alibaba range; this caused GitLab to be temporarily inaccessible by KDE developers. I then discovered that, one week ago, an Anime girl started appearing on the GNOME GitLab instance, as the page was loaded. It turns out that it's the default loading page for Anubis, a proof-of-work challenger that blocks AI scrapers that are causing outages. -- FOSS infrastructure is under attack by AI companies, by Niccolò Venerandi, for LibreNews

What do SourceHut, GNOME’s GitLab, and KDE’s GitLab have in common, other than all three of them being forges? Well, it turns out all three of them have been dealing with immense amounts of traffic from “AI” scrapers, who are effectively performing DDoS attacks with such ferocity it’s bringing down the infrastructures of these major open source projects. Being open source, and thus publicly accessible, means these scrapers have unlimited access, unlike with proprietary projects. … Everything about this “AI” bubble is gross, and I can’t wait for this bubble to pop so a semblance of sanity can return to the technology world. Until the next hype train rolls into the station, of course. -- FOSS infrastructure is under attack by AI companies, by Thom Holwerda, for OSnews

Please stop externalizing your costs directly into my face

FOSS infrastructure is under attack by AI companies

*2025-03-22**. Ordinary sysadmins get hit as well. Here's Sean Conner of the The Boston Diaries: He reports on Friday, March 21, 2025 that his logs show a total of 468439 requests for February 2025. The top hitter was 4.231.104.62 with 43242 requests (9%). This was from MICROSOFT-CORP-MSN-AS-BLOCK, US. But the ASN has more networks, of course. Adding them all up give 78889 (17%).

Friday, March 21, 2025

He links to the IP to ASN Mapping Service by Team Cymru. I started switching my network-lookup script to using it because it also supports IPv6. Something that I haven't done is find the ASN and then block all the blocks belonging to the ASN. That's where I want to be, actually.

IP to ASN Mapping Service

network-lookup

*2025-03-26**. More media are picking it up, but with a strange focus on "open source".

As it currently stands, both the rapid growth of AI-generated content overwhelming online spaces and aggressive web-crawling practices by AI firms threaten the sustainability of essential online resources. The current approach taken by some large AI companies—extracting vast amounts of data from open-source projects without clear consent or compensation—risks severely damaging the very digital ecosystem on which these AI models depend. -- Open Source devs say AI crawlers dominate traffic, forcing blocks on entire countries, by Benj Edwards, for Ars Technica

overwhelming

extracting

Open Source devs say AI crawlers dominate traffic, forcing blocks on entire countries

@bagder@mastodon.social recently had some numbers:

The AI bots that desperately need OSS for code training, are now slowly killing OSS by overloading every site.

The curl website is now at 77TB/month, or 8GB every five minutes.

@gluejar@tilde.zone writes:

There's a war going on on the Internet. AI companies with billions to burn are hard at work destroying the websites of libraries, archives, non-profit organizations, and scholarly publishers, anyone who is working to make quality information universally available on the internet. -- AI bots are destroying Open Access, by Eric Hellman

AI bots are destroying Open Access

*2025-04-02**. The bots keep eating everything of value.

Since January 2024, we have seen the bandwidth used for downloading multimedia content grow by 50%. This increase is not coming from human readers, but largely from automated programs that scrape the Wikimedia Commons image catalog of openly licensed images to feed images to AI models. Our infrastructure is built to sustain sudden traffic spikes from humans during high-interest events, but the amount of traffic generated by scraper bots is unprecedented and presents growing risks and costs. -- How crawlers impact the operations of the Wikimedia projects, Birgit Mueller, Chris Danis and Giuseppe Lavagetto, all for the Wikimedia Foundation

How crawlers impact the operations of the Wikimedia projects