gemini - transjovian.org

As a server author, you need to expect bots to visit your site. Some of them get trapped in endless loops in an attempt to index dynamic sites, ancient histories, or alternate presentations, ignoring your /robots.txt file. Some of them relentlessly hammer your site ignoring status 44 "Slow Down".

Therefore, either make sure your server only serves a number of static files so that it can handle the bots, or hide the dynamic parts of your site behind a client certificate requirement, or block bots.

What is the problem?

A traditional server (for the web or Gopher) serves files. This means every URL maps to a file, there is a limited number of these files, the server can serve them in parallel, and quickly. There is enough bandwidth to serve all these files, and there are enough resources to open and read all these files, possibly at the same time. Since many bots are parallelised, servers need to be able to serve in parallel, too.

All of these assumptions can break down, however. If you're serving a wiki, with page histories, and differences between the various revisions, change logs going back in time, and so on, then that means a lot of "useless" hits that a bot spidering a site for a search engine doesn't need.

If you're translating some sort of internal format to Gemtext when a resource is requested, then this takes extra processing power. The benefit is that this reduces disk space requirements since you don't have a static copy. This is a useful trade-off, if you have many thousands of files but only a small number of them are being read by humans at any point in time. It allows you to serve changed resources without regenerating the whole site, or without having to figure out, which files to regenerate. This can be difficult to do. Remember the Portland Pattern Repository (the first wiki) which appends a question-mark to links that point to missing pages, or Wikipedia using colour to indicate links to missing pages. Upon page creation, you would have to find and regenerate all the pages that already link to the page that is no longer missing.

If you're hosting a site on a constrained system, there are limits on bandwidth, memory and the number of open files to consider.

In any case, server authors have a way to communicate to clients that they should slow down: status 44. And they have a way to communicate to bots which URLs not to request via /robots.txt.

The problem is bots that don't care because their programmers don't care. These programmers, through ignorance, carelessness or malice do not monitor their bots and don't realize what is happening. The rest of this pages introduces some techniques to ward off their bots.

Blocking bots by logging IP numbers and using fail2ban

One way to deal with bots is to log the IP numbers of visitors to a file and let a separate process like fail2ban handle the blocking: write your fail2ban rules such that every access counts as a failed login attempt, and then pick a reasonable limit like 30 requests in 60 seconds (averaging 2/s but allowing for peaks). fail2ban now blocks any visitor exceeding that rate using the firewall. By default, the block is for 10 minutes; if you activate the “recidive” setup, then such blocks now trigger another rule, where you can define yet another number of hits, like 3 blocks in 1 hour. This second block might extend for a week, for example.

For visitors, getting blocked by the firewall means that their connections are simply refused. There is no explanation shown on their end. It’s definitely going to save you resources since your server doesn’t have to serve these requests, but occasionally innocent people are blocked without understanding what is happening to them.

Logging IP numbers is a bit of an issue since under the General Data Protection Regulation (GDPR) they count as personal information. You can of course argue that you really need them to keep the site up, but perhaps there are other options.

More information about setting up fail2ban

Blocking bots without logging IP numbers

A solution without logging IP numbers means that the server needs to keep a list of most recent visitors in memory. Pick a reasonable limit like 30 requests in 60 seconds. In this case, the server needs 30 timestamps for every visitor. If there are 30 timestamps, it compares the oldest one to the current time and if they are less than 60 seconds apart, the server replies with a status 44 “Slow Down!” and a number of seconds that seems reasonable to you.

One way of handling bots that are forever patient and ruthless, is to increase the ban every time it is triggered. That is, for ever ban, remember for how many seconds you banned the IP number and when it violates the ban (ignoring the status 44), double the remaining time. So if you ban a visitor for 60 seconds, and they make another request within 60 seconds, ban them for 120 seconds, and so on. Maybe have an upper limit, say four weeks. Just in case somebody else ends up with that IP number.

Banning IP numbers is problematic

It’s true. Perhaps there’s a shared server at that IP number. One of the users on that server writes a misbehaving bot and all are punished. If you are concerned about that, your server needs to move the dynamic content behind a client certificate requirement. There is no other way to identify particular users using Gemini.

What about dynamic IP numbers?

Some bot authors have rented cloud servers using IP numbers from Amazon, OVH, Hetzner, and others. If your server blocks one of the IP numbers, they’ll switch to another one. How do we handle that?

If you can determine that a bunch of them belong to the same network, you can block the entire network. Switching IP numbers is now much less effective. This works quite well because there are “residential” networks for ordinary people, and “cloud server” networks, so chances are you’ll only be blocking cloud servers, not ordinary people like you an me.

One way of dealing with this is to make a reverse lookup. This only works for IPv4, for now. Assume that the offending IP number is 178.209.50.237, but you have other IP numbers currently blocked as well. You need to reverse that IP number, add it to “asn.routeviews.org”, and make a DNS query. You’ll get back a TXT record telling you which IP to ban.

dig -4 -t TXT 237.50.209.178.asn.routeviews.org

The answer section:

237.50.209.178.asn.routeviews.org. 86400 IN TXT	"29691" "178.209.32.0" "19"

This tells you that you can ban the entire network by banning the IP range 178.209.32.0/19.

Needless to say, if banning IP numbers is problematic, then banning entire IP ranges is even more problematic. Then again, an abundance of unsupervised bots is just as problematic.