2020-12-25 Defending against crawlers

Recently I was writing about my dislike of crawlers. They turned into a kind of necessary evil on the web – but it’s not too late to choose a different future for Gemini. I want to encourage all server authors and crawler authors to think long and hard about alternatives.

2020-12-22 Crawling

2020-12-22 Apache config file to block user agents

One feature I dislike about crawlers is that they follow all the links. Sure, we have a semi-useful “robots.txt” specification but it’s easy to get wrong on both sides. I’ve had bugs in my “robots.txt” file for a long time without noticing them.

Now, if the argument is that I cannot prevent crawlers from leeching my site, then the reply is of course that I will try to defend myself even if it is impossible to get 100% right. The first line of defence is going to be my “robots.txt” file. It’s not perfect, and that’s fine. It’s not perfect because I just need to look at the Apache config file I use to block all the misbehaving bots and user agents.

Ugh, look at the bots hitting my websites:

$ /home/alex/bin/bot-detector < /var/log/apache2/access.log.1
--------------Bandwidth-------Hits-------Actions--Delay
   Everybody      2416M     102520
    All Bots       473M      23063   100%    19%
-------------------------------------------------------
     bingbot    240836K       8157    35%    31%    10s
   YandexBot     36279K       3905    16%     3%    22s
   Googlebot     65808K       3679    15%    34%    23s
      Adsbot     20187K       3115    13%     0%    27s
    Applebot     66607K        908     3%     0%    95s
     Facebot      1611K        390     1%     0%   220s
    PetalBot      1548K        329     1%    12%   257s
         Bot      2101K        308     1%     0%   280s
      robots       525K        231     1%     0%   374s
    Slackbot      1339K        224     0%    96%   382s
  SemrushBot       572K        194     0%     0%   438s

A full 22% of all user agents have something like “bot” in their name. Just look at them! Let’s take the last one, SemrushBot. The user agent also has a link, and if you want, you can take a look. All the goals it lists are disgusting, or benefit corporations and not me, nor other humans. Barf with me as you read statements such as “the Brand Monitoring tool to index and search for articles” or “the On Page SEO Checker and SEO Content template tools reports”. 🤮

SEMrushBot

Have a look at your own webserver logs. 22% of my CPU resources, of the CO₂ my server produces, of the electricity it eats, for machines that do not have my best interest in mind. I don’t want a web that’s 20% bots crawling all over my site. I don’t want a Gemini space that’s 20% bots crawling all over my capsules.

OK, so let’s talk about defence.

When I look at my Gemini logs, I see that plenty of requests come from Amazon hosts. I take that as a sign of autonomous agents. I might sound like a fool on the Butlerian Jihad, but if I need to block entire networks, then I will. Looking up WHOIS data also costs resources. It would be better if we could identify these bots by looking at their behaviour.

As explained in *Dune*, the Butlerian Jihad is a conflict taking place over 11,000 years in the future (and over 10,000 years before the events of *Dune*) which results in the total destruction of virtually all forms of “computers, thinking machines, and conscious robots”. – The Butlerian Jihad

The Butlerian Jihad

The first mistake crawlers make is that they are too fast. So here’s what I’m currently doing: for every IP, I’m keeping track of the last 30 requests in the last 60s. If there are more requests, the IP number is blocked. Thus, if your average clicking rate is more than 1 click per 2s over a 1min window, you’re probably a bot and you get blocked. I might have to turn this up. Perhaps 1 click per 5s makes more sense for a human.

But there’s more. I see the crawlers clicking on all the links. All the HTML renderings of the pages are already available via Gemini. It makes no sense to request all of these. All the raw wiki text of the pages are available as well. It makes no sense to request all of these, either. All the links to leave a comment are also on every page. It makes no sense to request all of these either.

Here’s what I’m talking about. I picked an IP number from the logs and checked what they’ve been requesting:

2020-12-25 08:32:37 gemini://transjovian.org:1965/page/Linking/2
2020-12-25 08:32:45 gemini://alexschroeder.ch:1965/history/Perl
2020-12-25 08:32:59 gemini://communitywiki.org:1965/page/CategoryWikiProcess
2020-12-25 08:33:18 gemini://transjovian.org:1965/page/Titan/5
2020-12-25 08:33:30 gemini://communitywiki.org/page/CultureOrganis%C3%A9e
2020-12-25 08:33:57 gemini://transjovian.org:1965/history/Spaces
2020-12-25 08:34:23 gemini://transjovian.org:1965/gemini/page/common%20wiki%20structure/TimurIsmagilov
2020-12-25 08:34:30 gemini://alexschroeder.ch:1965/tag/Hex%20Describe
2020-12-25 08:34:56 gemini://communitywiki.org:1965/page/SoftwareBazaar
2020-12-25 08:35:02 gemini://communitywiki.org:1965/page/DoTank
2020-12-25 08:35:22 gemini://transjovian.org:1965/test/history/Welcome
2020-12-25 08:36:56 gemini://alexschroeder.ch:1965/tag/Gadgets
2020-12-25 08:38:20 gemini://alexschroeder.ch:1965/tag/Games
2020-12-25 08:45:58 gemini://alexschroeder.ch:1965/do/comment/GitHub
2020-12-25 08:46:05 gemini://alexschroeder.ch:1965/html/GitHub
2020-12-25 08:46:12 gemini://alexschroeder.ch:1965/raw/Comments_on_GitHub
2020-12-25 08:46:19 gemini://alexschroeder.ch:1965/raw/GitHub
2020-12-25 08:47:45 gemini://alexschroeder.ch:1965/page/2018-08-24_GitHub
2020-12-25 08:47:51 gemini://alexschroeder.ch:1965/do/comment/Comments_on_2018-08-24_GitHub
2020-12-25 08:47:57 gemini://alexschroeder.ch:1965/html/Comments_on_2018-08-24_GitHub
2020-12-25 09:21:26 gemini://alexschroeder.ch:1965/do/more

See what I mean? This is not a human. This is an unsupervised bot, otherwise the operator would have discovered that this makes no sense.

The solution I’m using for my websites is logging IP numbers and using fail2ban to ban IP numbers that request too many pages. The ban is for 10min, and if you’re a “recidive”, meaning you got banned three times for 10min, then you’re going to be banned for a week. The problem I have is that I would prefer a solution that doesn’t log IP numbers. It’s good for privacy and we should write our software such that privacy comes first.

So I wrote a Phoebe extension called “speed bump”. Here’s what it currently does.

For every IP number, Phoebe records the last 30 requests in the last 60 seconds. If there are more than 30 requests in the last 60 seconds, the IP number is blocked. If somebody is faster on average than two seconds per request, I assume it’s a bot, not a human.

For every IP number, Phoebe records whether the last 30 requests were suspicious or not. A suspicious request is a request that is “disallowed” for bots according to “robots.txt” (more or less). If 10 requests or more of the last 30 requests in the last 60 seconds are suspicious, the IP number is also blocked. That is, even if somebody is as slow as three seconds per request, if they’re all suspicious, I assume it’s a bot, not a human.

When an IP number is blocked, it is blocked for 60s, and there’s a 120s probation time. When you’re blocked, Phoebe responds with a “44” response. This means: slow down!

If the IP number sends another request while it is blocked, or if it gives cause for another block in the probation time, it is blocked again and the blocking time is doubled: the IP is blocked for 120s and probation is extended by 240s. And if it happens again, it is doubled again: blocked for 240s and probabation is extended by 480s.

The “/do/speed-bump/debug” URL (which requires a known client certificate) shows you the raw data, and the “/do/speed-bump/status” URL (which also requires a known client certificate) shows you a human readable summary of what’s going on.

Here’s an example:

Speed Bump Status
 From    To Warns Block Until Probation IP
 n/a   n/a   0/ 0   60s  n/a       100m 3.8.145.31
 n/a   n/a   0/ 0   60s    4h       14h 35.176.162.140
 n/a   n/a   0/ 0   60s  n/a         9h 18.134.198.207
-280s   -1s  7/30  n/a   n/a       n/a  3.10.221.60

All four of these numbers belong to “Amazon Data Services UK”.

If there are numbers in the “From” and “To” columns, that means the IP made a request in the last 60s. The “Warns” column says how many of the requests were considered “suspicious”. “Block” is the block time. As you can see, none of the bots managed to increase the block time. Why is that? The “Probation” column offers a glimpse into what happened: as the bots kept making requests while they were blocked, they kept adding to their own block.

A bit later:

Speed Bump Status
 From    To Warns Block Until Probation IP
 n/a   n/a   0/ 0   60s  n/a        83m 3.8.145.31
 n/a   n/a   0/ 0   60s    4h       13h 35.176.162.140
 n/a   n/a   0/ 0   60s  n/a         9h 18.134.198.207
-219s   -7s  3/30  n/a   n/a       n/a  3.10.221.60

It seems that the last IP number is managing to thread the line.

Clearly, this is all very much in flux. I’m still working on it – and finding bugs in my “robots.txt”, unfortunately. I’ll keep this page updated as I learn more. One idea I’ve been thinking about is the time windows: how many pages would an enthusiastic human read on a new site: 60 pages in an hour, one minute per page? Or maybe twice as much? That would point towards keeping a counter for a long term average: if you’re requesting more than 60 pages in 30min, perhaps a timeout of 30min is appropriate?

The smol net is also a slow net. There’s no need for almost all activity to be crawlers. If at all, crawlers should be the minority! So, if my sites had 95% human activity and 5% robot activity, I’d be more understanding. But right now, it’s crazy. All the CO₂ wasted, for bots.

I’m on The Butlerian Jihad!

Speed Bump for Phoebe

#Gemini #Phoebe #Bots #Butlerian Jihad

Comments

(Please contact me if you want to remove your comment.)

⁂

Wouldn’t you get most of them by just blocking everything with “[Bb]ot” in the User-Agent?

– Adam 2020-12-25 16:15 UTC

Adam

---

It depends on what your goal is, and on the protocol you’re talking about. In the second half of my post I was talking about Gemini. That is a very simply protocol: establish a TCP/IP connection, with TLS, send a URI, get bet a status header line + content. That is, the request does not contain any header lines, unlike HTTP.

As for HTTP, which I mention in the first half: if a search engine were to crawl the new pages on my sites, slowly, then I wouldn’t mind so much, as long as the search engine is one intended for humans (these days that would be Google and Bing, I guess). I’d like to block those that misbehave, or that have goals I disagree with, and I’d like not to block the future search engine that is going to dethrone Google and Bing. I need to keep that hope alive, in any case. So if I want a nuanced result, I need a nuanced response. Slow down bots that can take a hint. Block bots that don’t. Block bots from dubious companies. And so on.

– Alex 2020-12-25 21:47 UTC

---

Here’s the current status of my “speed bump” extension to Phoebe:

Speed Bump Status
 From    To Warns Block Until Probation IP
 -10m   -9m 11/11  365d  364d      729d 3.11.81.100
 -12h  -12h 11/11  365d  364d      729d 18.130.221.176
 -12h  -12h 11/13  365d  364d      729d 3.9.134.250
 -14h  -14h 11/15  365d  364d      729d 3.8.127.24
 -14h  -14h 11/13  365d  364d      729d 167.114.7.65
 -10h  -10h 11/12  365d  364d      729d 18.134.146.76
 -16m  -14m 11/12  365d  364d      729d 3.10.232.193

All of these IP numbers have blocked themselves for over a year (or until I restart the server). Usign “whois” to identify the organisation (and verifying my guess for tilde.team using “dig”) we get the following:

3.11.81.100     Amazon Data Services UK
18.130.221.176  Amazon Data Services UK
3.9.134.250     Amazon Data Services UK
3.8.127.24      Amazon Data Services UK
167.114.7.65    Tilde Team
18.134.146.76   Amazon Data Services UK
3.10.232.193    Amazon Data Services UK

Oh well. Every new IP number is going to make 10–20 requests and it’s going to add a line. We could improve upon the model: once an IP is blocked for a year (the maximum), then use WHOIS to look up the IP number range. Taking the first number for example, we find that the “NetRange” is 3.8.0.0 - 3.11.255.255 and the “CIDR” is 3.8.0.0/14. Keep watching, once we have three IP numbers from the entire range blocked, there’s no need to block them all individually, we can just block the whole range. In our example, we would have reacted once we had blocked 3.11.81.100, 3.9.134.250, and 3.8.127.24. At that point, 3.10.232.193 would have been blocked preemptively.

Compare this to how GUS works. Indexing runs are made a few times a month. The IP numbers the requests come from a documented. They don’t change like the crawler (or crawlers?) running on Amazon. I’m tempted to say the bot operators hosting their bot on Amazon look like they are actively trying to evade the block. It feels like trespassing and it makes me angry.

GUS indexing documentation

– Alex 2020-12-26

---

Tilde Team is probably people, not a crawler. I gave more details in a reply to your toot.

– petard 2020-12-26 19:21 UTC

petard

---

For those who don’t follow us on Mastodon… 😁 I replied with a screenshot of more or less the following, saying that the requests made from Tilde Team seem to indicate that this is an unsupervised crawler, not humans. The vast majority of requests is from a bot.

2020-12-27 01:20:31 gemini://alexschroeder.ch:1965/2008-05-09_Ontology_of_Twitter
2020-12-27 01:20:40 gemini://alexschroeder.ch:1965/2011-02-14_The_Value_of_a_Web_Site
2020-12-27 01:20:48 gemini://alexschroeder.ch:1965/2013-01-23_Security_of_Code_Downloaded_from_Online_Sources
2020-12-27 01:20:54 gemini://alexschroeder.ch:1965/2016-05-28_nginx_as_a_caching_proxy
2020-12-27 01:21:01 gemini://alexschroeder.ch:1965/Comments_on_2011-02-14_The_Value_of_a_Web_Site
2020-12-27 01:24:54 gemini://transjovian.org:1965/gemini/diff/common%20wiki%20structure/1
2020-12-27 01:25:01 gemini://transjovian.org:1965/gemini/diff/common%20wiki%20structure/2
2020-12-27 01:25:08 gemini://transjovian.org:1965/gemini/diff/common%20wiki%20structure/3
2020-12-27 01:25:15 gemini://transjovian.org:1965/gemini/do/atom
2020-12-27 01:25:23 gemini://transjovian.org:1965/gemini/do/rss
2020-12-27 01:25:29 gemini://transjovian.org:1965/gemini/page/common%20wiki%20structure/1
2020-12-27 01:25:37 gemini://transjovian.org:1965/gemini/page/common%20wiki%20structure/2
2020-12-27 01:25:43 gemini://transjovian.org:1965/gemini/page/common%20wiki%20structure/3
2020-12-27 01:46:49 gemini://communitywiki.org:1965/do/comment/BestPracticesForWikiTheoryBuilding
2020-12-27 01:46:58 gemini://communitywiki.org:1965/html/BestPracticesForWikiTheoryBuilding
2020-12-27 01:47:04 gemini://communitywiki.org:1965/page/PromptingStatement
2020-12-27 01:47:11 gemini://communitywiki.org:1965/page/WeLoveVolunteers
2020-12-27 01:47:18 gemini://communitywiki.org:1965/raw/BestPracticesForWikiTheoryBuilding
2020-12-27 01:47:26 gemini://communitywiki.org:1965/raw/Comments_on_BestPracticesForWikiTheoryBuilding
2020-12-27 01:47:33 gemini://communitywiki.org:1965/tag/inprogress
2020-12-27 01:47:41 gemini://communitywiki.org:1965/tag/practice
2020-12-27 01:47:48 gemini://communitywiki.org:1965/tag/practices
2020-12-27 01:47:56 gemini://communitywiki.org:1965/tag/prescription
2020-12-27 01:48:02 gemini://communitywiki.org:1965/tag/prescriptions
2020-12-27 01:48:11 gemini://communitywiki.org:1965/tag/recommendation
2020-12-27 01:48:16 gemini://communitywiki.org:1965/tag/recommendations
2020-12-27 01:48:23 gemini://communitywiki.org:1965/tag/theorybuilding
2020-12-27 01:51:05 gemini://communitywiki.org:1965/do/comment/HansWobbe
2020-12-27 01:51:08 gemini://communitywiki.org:1965/html/HansWobbe
2020-12-27 01:57:51 gemini://communitywiki.org:1965/page/BlikiNet
2020-12-27 02:17:04 gemini://communitywiki.org:1965/page/ChainVideo
2020-12-27 02:28:46 gemini://communitywiki.org:1965/page/CwbHwoAg
2020-12-27 02:58:36 gemini://communitywiki.org:1965/page/DfxMapping

Suspicious signs:

visiting date pages from all over the place (2008, 2011, 2013, 2016)
visiting all the old revisions of a page (/1, /2, /3)
visiting all the diffs of a page (/1, /2, /3)
visiting the comment prompt and not leaving a comment (do/comment)
visiting lots of tags (/tag)
visiting HTML copies of pages without looking at the Gemini copies (/html)
visiting raw copies of pages without looking at the Gemini copies (/raw)

These are not people. This is a crawler verifying its database. And ignoring robots.txt.

I think the main problem is that I run multiple sites served via Gemini with thousands of pages, and all the pages have links to alternate views (page history, page diff, HTML copy, raw copy, comments prompt), so perhaps mine are the only sites where crawlers might actually get to their limits. If somebody new sets up a Gemini server and serves two score static gemtext files, then these crawlers do little harm. But as it stands, there’s a constant barrage on my servers that stands in no relation to the amount of human activity.

Some of these URIs are violating robots.txt. But it’s not just that. I also feel a moral revulsion: all the CO₂ wasted shows a disregard for resources these people are not paying for. This is exactly the problem our civilisation faces, on a small scale.

Thus, where as GoogleBot and BingBot might be nominally useful (the wealth concentration we’ve seen as a consequence of their data gathering notwithstanding), the ratio of change to crawl is and remains important. Once a site is crawled, how often and what URLs should you crawl again? The current system is so wasteful.

Anyway, I have a lot of anger in me.

– Alex 2020-12-27

---

That’s a good summary of our conversation. My suggestion that requests from Tilde Team were probably people was based on the fact that it’s a public shell host that people use to browse gemini. (I have an account there and use it happily. It’s mostly a nice place with people I like to talk to. I am not otherwise affiliated.)

Seeing that log dump makes it clear that someone on that system is behaving badly.

– petard 2020-12-27 14:32 UTC

petard

---

Current status:

Speed Bump Status
 From    To Warns Block Until Probation IP
 -33m  -33m 30/30   28d   27d       55d 78.47.222.156 78.46.0.0/15
 -17h  -17h 11/11   28d   27d       55d 3.9.165.84 3.8.0.0/14
 -46h  -46h 17/17   28d   26d       54d 18.130.170.163 18.130.0.0/16
  -2d   -2d 11/11   28d   26d       54d 18.134.12.41 18.132.0.0/14
 -44h  -44h 11/11   28d   26d       54d 18.132.209.113 18.132.0.0/14
 -22h  -22h 13/13   28d   27d       55d 35.178.128.94 35.178.0.0/15
 -38h  -38h 12/12   28d   26d       54d 3.8.185.90 3.8.0.0/14
 -17h  -17h 12/12   28d   27d       55d 35.177.73.123 35.176.0.0/15
 -42h  -42h 11/11   28d   26d       54d 18.130.151.101 18.130.0.0/16
  -5h   -5h 13/13   28d   27d       55d 167.114.7.65 167.114.0.0/17
 -17h  -17h 14/14   28d   27d       55d 52.56.225.165 52.56.0.0/16
 -42h  -42h 12/12   28d   26d       54d 18.135.104.61 18.132.0.0/14
  -8h   -8h 12/12   28d   27d       55d 35.179.91.110 35.178.0.0/15
  -4h   -4h 11/11   28d   27d       55d 18.130.166.9 18.130.0.0/16
 -20h  -20h 11/11   28d   27d       55d 52.56.232.202 52.56.0.0/16
 -36h  -36h 13/13   28d   26d       54d 35.178.91.123 35.178.0.0/15
 -36h  -36h 11/11   28d   26d       54d 3.8.195.248 3.8.0.0/14

Until CIDR
  27d 18.130.0.0/16
  27d 3.8.0.0/14
  27d 35.178.0.0/15
  26d 18.132.0.0/14
→ menu

Almost all of them Amazon Data Services UK, a few Hetzner, some OVH Hosting.

Seeing whole net ranges being blocked makes me happy. The code seems to work as expected.

– Alex 2020-12-29 16:35 UTC

---

Let’s check the number of requests blocked, relying on the Phoebe log files. “Looking at ” is an info log message it prints for every request. Let’s count them:

# journalctl --unit phoebe --since 2020-12-29|grep "Looking at"|wc -l
11700

Let’s see how many are caught by network range blocks:

# journalctl --unit phoebe --since 2020-12-29|grep "Net range is blocked"|wc -l
1812

Let’s see how many of them are just lone IP numbers being blocked:

# journalctl --unit phoebe --since 2020-12-29|grep "IP is blocked"|wc -l
2862

And first time offenders:

# journalctl --unit phoebe --since 2020-12-29|grep "Blocked for"|wc -l
8

I guess that makes 4682 blocked bot requests out of 11700 requests, or 40% of all requests.

The good news is that more than half seem to be legit? Or are they? I’m growing more suspicious all the time.

Let’s check HTTP access!

# journalctl --unit phoebe --since 2020-12-29|grep "HTTP headers"|wc -l
320
# journalctl --unit phoebe --since 2020-12-29|grep "HTTP headers"|perl -e 'while(<STDIN>){m/(\w*bot\w*)/i; print "$1\n"}'|sort|uniq --count
      1
     22 bingbot
      2 Bot
     80 googlebot
     34 Googlebot
     88 MJ12bot
     32 SeznamBot
     61 YandexBot

That is, of the 11700 requests I’m looking at, I’ve had 320 web requests, of which 319 (!) where bots.

I think the next step will be to change the robots.txt served via the web to disallow them all.

– Alex 2020-12-30 11:40 UTC

---

Hm, but blocking IPAs the style you mention would e.g. block my hacker space, where I’ve told a bunch of nerds that Gemini is cool, and they should have a look at … your site. And if it isn’t a hacker space, it’s a student’s dorm, or similar, behind NAT.

I understand your anger, but blocking IPAs in the end isn’t better than Hotmail & Google not accepting mail from my host - they think it’s suspicious, because it’s small (it has proper DNS, no blacklist and so on, they just ASSUME it would could be wrong. Internet is “everyone can talk to everyone”, and my approach is to make that happen. Every counter approach is breaking the Internet, IMHO. YMMV.

– Götz 2021-01-05 23:40 UTC

---

How would you defend against bad actors, then? Simply accept it as a fact of life and add better infrastructure, or put the “smol net” behind a login? If all I have is an IP number of a peer connecting to my server, then all the consequences must relate to the IP number, or there must be no consequences. That’s how I understand the situation.

– Alex Schroeder 2021-01-06 11:09 UTC

---

Here’s an update. Filtering the log, I see about 8000 requests:

# journalctl --unit phoebe | grep "\[info\] Looking at" | wc -l
8161

A full three quarters of them are currently blocked!

# journalctl --unit phoebe | grep "\[info\] .* is blocked" | wc -l
6197

The list keeps growing. I decided to write a script that would retrieve this page for me, and call WHOIS for all the networks identified.

#!/usr/bin/perl
use Modern::Perl;
use Net::Whois::IP qw(whoisip_query);
say "Requesting data";
my $data = qx(gemini --cert_file=/home/alex/.emacs.d/elpher-certificates/alex.crt --key_file=/home/alex/.emacs.d/elpher-certificates/alex.key gemini://transjovian.org/do/speed-bump/status);
say "Reading blocked networks";
my %seen;
while ($data =~ /(\d+\.\d+\.\d+\.\d+|[0-9a-f]+:[0-9a-f]+:[0-9a-f:]+)\/\d+/g) {
  my $ip = $1;
  next if $seen{$ip};
  $seen{$ip} = 1;
  my $response = whoisip_query($ip);
  my $name = $response->{OrgName} || $response->{netname} || $response->{Organization} || $response->{owner};
  my $country = $response->{country} || $response->{Country} || $response->{Country-Code} ;
  $name .= " ($country)" if $name and $country;
  if ($name) {
    say "$ip $name";
  } else {
    say "$ip";
    for (keys %$response) {
      say "  $_: $response->{$_}";
    }
  }
}

Let’s see:

Reading blocked networks
146.185.64.0 SAK-FTTH-Pool1 (CH)
35.176.0.0 Amazon Data Services UK
52.56.0.0 Amazon Data Services UK
52.88.0.0 Amazon Technologies Inc.
201.159.61.0 Grupo Servicios Junin S.A. (AR)
201.159.60.0 Grupo Servicios Junin S.A. (AR)
35.178.0.0 Amazon Data Services UK
18.130.0.0 Amazon Data Services UK
81.170.128.0 GENERAL-PRIVATE-NET-A258-4 (SE)
3.8.0.0 Amazon Data Services UK
116.203.0.0 STUB-116-202SLASH15 (ZZ)
186.0.160.0 Grupo Servicios Junin S.A. (AR)
135.181.0.0 DE-HETZNER-19931109 (DE)
18.132.0.0 Amazon Data Services UK
18.168.0.0 Amazon Data Services UK
130.211.0.0 Google LLC
193.70.0.0 FR-OVH-930901 (FR)
67.60.37.0 CABLE ONE, INC.
140.82.24.0 Vultr Holdings, LLC
83.248.0.0 SE-TELE2-BROADBAND-CUSTOMER (SE)
195.138.64.0 TENET (UA)
185.87.121.0 NETUCE-BILISIM (TR)
2.80.0.0 MEO-BROADBAND (PT)
67.205.144.0 DigitalOcean, LLC
68.183.128.0 DigitalOcean, LLC
173.230.145.0 Linode

Some thoughts:

Switzerland, Argentina, Germany, Sweden, Ukraine, Turkey, Portugal
Lots of Amazon
Google Cloud

The STUB result stands out. If you run whois yourself:

person:         STUB PERSON
address:        N/A
country:        ZZ
phone:          +00 0000 0000
e-mail:         no-email@apnic.net

OK…

Some of them are residential networks, i.e. people operating from home, and thereby I’m blocking all the other people from the same residential network. This is something that hurts a lot more than me blocking cloud service providers.

FTTH stands for fibre to the home, and SAK is a Swiss power plant company, St.Gallisch-Appenzellische Kraftwerke AG
MEO says “role: MEO-RESIDENCIAL”
Linode says “This block is used for static customer allocations.”
The two SE entries say “Dynamic private network” and “Tele2 customer broadband access”

– Alex 2021-07-26 08:50 UTC

---

@bortzmeyer commented and said that 6000 of these requests per day is one request every 14s, that is: minuscule load. And that is true. But it still angers me because of the sliding slope. Where do you draw the line? I have to block Fediverse user agents from my web pages because when I share a link to my site, all the instances fetch a preview of the link. I get hundreds of requests in a few minutes. That means I can no longer serve my site from a Perl CGI script on a 2G virtual machine. Is this my problem, or are the Fediverse developers to blame? Or perhaps it is the mindset that aggravates me.

@bortzmeyer

To me, this is the attitude with which we destroy so many things: we can’t be frugal with computing cycles, memory requirements, road capacity, electricity consumption unless there is a price to be paid, so we carelessly claim it all, waste it all, and then we can’t back down from it all when we’ve reached the limits. How much better to only take what you need.

If you are interested:

Mastodon can be used as a DDOS tool #4486

– Alex 2021-07-26 11:53 UTC

---

Looking at my Gemini logs…

Total requests:

# journalctl --unit phoebe | grep "Looking at" | wc -l
22647

IP numbers and networks blocked:

# journalctl --unit phoebe | grep "IP is blocked" | wc -l
19329
# journalctl --unit phoebe | grep "Net range .* is blocked" | wc -l
141

That leaves 3177 requests that were actually served. Or 86% of all requests were bots.

The time period covered is about 2¼ days.

# journalctl --unit phoebe | head -n 1
-- Logs begin at Fri 2021-08-20 06:48:21 CEST, end at Sun 2021-08-22 12:36:04 CEST. --

– Alex

---

But hey! Gemini hit a milestone! Script kiddies have hit the scene and now we have to contend with their crap! Woot! – The script kiddies have come to Gemini

The script kiddies have come to Gemini

– Alex 2021-08-29 16:01 UTC