At midnight, there was a surge in activity. CPU usage went up.
Load went up, too. But it stayed within reasonable bounds -- less than 4 instead of the more than 80 I have seen in the past.
And the number of IP addresses blocked by `fail2ban` went from 40 to 50.
I'm usually sceptical of this because the big attacks are from a far wider variety of IP numbers. In this case, however, maybe there was some probing that resulted in blocks? I don't know. Lucky, I guess?
In any case, the site is still up. Yay for small wins.
Also, I cannot overstate how good it feel to have some Munin graphs available.
`alex-bots` is a setup I desribed in 2025-02-19 Bots again, cursed. Basically a request to one of my Oddmuse wikis containing the parameter `rcidonly` is an expensive endpoint: "all changes for this single page" or "a feed for this single page". This is something a human would rarely access and yet it somehow the URLs landed in some dataset for AI training, I suspect. So what I do is I’m redirecting any request containing “rcidonly” in the query string to `/nobots`, warning humans not to click on these links.
In addition to that, the filter `/etc/fail2ban/filter.d/alex-bots.conf` contains this:
[Definition] failregex = ^(www\.emacswiki\.org|communitywiki\.org|campaignwiki\.org):[0-9]+ <HOST> .*rcidonly=
And I added a section using this filter to my jail `/etc/fail2ban/jail.d/alex.conf`:
[alex-bots] enabled = true port = http,https logpath = %(apache_access_log)s findtime = 3600 maxretry = 2
So if an IP number visits three URLs containing "rcidonly" in an hour, they get banned for ten minutes.
The `recidive` filter (a standard filter you just need to activate) then makes sure that any IP number that got blocked three times gets blocked for a week.
#Administration #Butlerian Jihad
If you think these crawlers respect `robots.txt` then you are several assumptions of good faith removed from reality. These bots crawl everything they can find, `robots.txt` be damned, including expensive endpoints like git blame, every page of every git log, and every commit in every repo, and they do so using random User-Agents that overlap with end-users and come from tens of thousands of IP addresses – mostly residential, in unrelated subnets, each one making no more than one HTTP request over any time period we tried to measure – actively and maliciously adapting and blending in with end-user traffic and avoiding attempts to characterize their behavior or block their traffic. -- Please stop externalizing your costs directly into my face, by Drew DeVault, for SourceHut
Then, yesterday morning, KDE GitLab infrastructure was overwhelmed by another AI crawler, with IPs from an Alibaba range; this caused GitLab to be temporarily inaccessible by KDE developers. I then discovered that, one week ago, an Anime girl started appearing on the GNOME GitLab instance, as the page was loaded. It turns out that it's the default loading page for Anubis, a proof-of-work challenger that blocks AI scrapers that are causing outages. -- FOSS infrastructure is under attack by AI companies, by Niccolò Venerandi, for LibreNews
What do SourceHut, GNOME’s GitLab, and KDE’s GitLab have in common, other than all three of them being forges? Well, it turns out all three of them have been dealing with immense amounts of traffic from “AI” scrapers, who are effectively performing DDoS attacks with such ferocity it’s bringing down the infrastructures of these major open source projects. Being open source, and thus publicly accessible, means these scrapers have unlimited access, unlike with proprietary projects. … Everything about this “AI” bubble is gross, and I can’t wait for this bubble to pop so a semblance of sanity can return to the technology world. Until the next hype train rolls into the station, of course. -- FOSS infrastructure is under attack by AI companies, by Thom Holwerda, for OSnews
Please stop externalizing your costs directly into my face
FOSS infrastructure is under attack by AI companies
FOSS infrastructure is under attack by AI companies
He links to the IP to ASN Mapping Service by Team Cymru. I started switching my network-lookup script to using it because it also supports IPv6. Something that I haven't done is find the ASN and then block all the blocks belonging to the ASN. That's where I want to be, actually.
As it currently stands, both the rapid growth of AI-generated content overwhelming online spaces and aggressive web-crawling practices by AI firms threaten the sustainability of essential online resources. The current approach taken by some large AI companies—extracting vast amounts of data from open-source projects without clear consent or compensation—risks severely damaging the very digital ecosystem on which these AI models depend. -- Open Source devs say AI crawlers dominate traffic, forcing blocks on entire countries, by Benj Edwards, for Ars Technica
Open Source devs say AI crawlers dominate traffic, forcing blocks on entire countries
@bagder@mastodon.social recently had some numbers:
The AI bots that desperately need OSS for code training, are now slowly killing OSS by overloading every site.
The curl website is now at 77TB/month, or 8GB every five minutes.
@gluejar@tilde.zone writes:
There's a war going on on the Internet. AI companies with billions to burn are hard at work destroying the websites of libraries, archives, non-profit organizations, and scholarly publishers, anyone who is working to make quality information universally available on the internet. -- AI bots are destroying Open Access, by Eric Hellman
AI bots are destroying Open Access
Since January 2024, we have seen the bandwidth used for downloading multimedia content grow by 50%. This increase is not coming from human readers, but largely from automated programs that scrape the Wikimedia Commons image catalog of openly licensed images to feed images to AI models. Our infrastructure is built to sustain sudden traffic spikes from humans during high-interest events, but the amount of traffic generated by scraper bots is unprecedented and presents growing risks and costs. -- How crawlers impact the operations of the Wikimedia projects, Birgit Mueller, Chris Danis and Giuseppe Lavagetto, all for the Wikimedia Foundation
How crawlers impact the operations of the Wikimedia projects