Today I added some numbers to my firewall block lists again. I feel somewhat bad about them, because I guess my robots.txt was not setup up correctly. At the same time, I feel like I don’t owe anything to unwatched crawlers.
So for the moment I banned:
What’s the country for global companies that don’t pay taxes? “From the Internet‽”
I’m still not quite sure what to do now. I guess I just don’t know how I feel about crawling in general. What would a network look like that doesn’t crawl? Crawling means that somebody is accumulating data. Valuable data. Toxic data. Haven’t we been through all this? I got along well with the operator of GUS when we exchanged a few emails. And yet, the crawling makes me uneasy.
Data parsimony demands that we don’t collect the data we don’t need; that we don’t store the data we collect; that we don’t keep the data we store. Delete that shit! One day somebody inherits, steal, leaks, or buys that data store and does things with it that we don’t want. I hate it that defending against leeches (eager crawlers I feel are misbehaving) means I need to start tracking visitors. Logging IP numbers. Seeing what pages the active IP numbers are looking at. Are they too fast for a human? Is the sequence of links they are following a natural reading sequence? I hate that I’m being forced to do this every now and then. And what if I don’t? Perhaps somebody is going to use Soweli Lukin to index Gopherspace? Perhaps somebody is going to use The Transjovian Vault to index Wikipedia via Gemini? Unsupervised crawlers will do anything.
There’s something about the whole situation that’s struggling to come out. I’m having trouble putting to words.
Like… There’s a certain lack of imagination out there.
People say: that’s the only way a search engine can work. Maybe? Maybe not? What if sites sent updates, compiled databases? A bit like the Sitemap format? A sort of compiled and compressed word/URI index? And if then very few people actually sent in those indexes, would that not be a statement in itself? Now people don’t object because it takes effort. But perhaps they wouldn’t opt-in either!
People say: anything you published is there for the taking. Well, maybe if you’re a machine. But if there is a group of people sitting around a cookie jar, you wouldn’t say “nobody is stopping me from taking them all.” Human behaviour can be nuanced and if we cannot imagine technical soltions that are nuanced, then I don’t feel like it’s on me to reduce my expectations. Perhaps it’s on implementors to design more nuanced solutions! And yes, those solutions are going to be more complicated. Obviously so! We’ll have to design ways to negotiate consent, privacy, data ownership.
It’s a failure of design if “anything you publish is there for the taking” is the only option. Since I don’t want this, I think it’s on me and others who dislike this attitude to confidently set boundaries. I use fail2ban to ban user agents who make too many requests, for example. Somebody might say: “why don’t you use a caching proxy?” The answer is that I don’t feel like it is on me to build a technical solution that scales to the corpocaca net; I should be free to run a site built for the smol net. If you don’t behave like human on the smol net, I feel free to defend my vision of the net as I see fit – and I encourage you to do the same.
People say: ah, I understand – you’re using a tiny computer. I like tiny computers. That’s why you want us to treat your server like it was smol. No. I want you to treat my server like it was smol because we’re on the smol net.
For my websites, I took a look at my log files and saw that at the very least (!) 21% of my hits are bots (18253 / 88862). Of these, 20% are by the Google bot, 19% are by the Bing bot, 10% are by the Yandex bot, 5% are by the Apple bot, and so on. And that is considering a long robots.txt, and a huge Apache config file to block a gazillion more user agents! Is this what you want for Gemini? The corpocaca Gemini? Not me!
The robots.txt my websites all share, more or less
The user agent block list my web server uses
The Transjovian Vault, a Gemini proxy for Wikipedia
Soweli Lukin, a web proxy for Gopher and Gemini
#Gemini #Web
(Please contact me if you want to remove your comment.)
⁂
Some more data, now that I’m looking at my logs. These are the top hits on my sites via Phoebe:
1 Amazon 1062 2 OVH Hosting 929 3 Amazon 912 4 Amazon 730 5 Amazon 653 6 Amazon 482 7 Amazon 284 8 Amazon 188 9 Hetzner 171 10 Amazon 129 11 OVH Hosting 55
Not a single human in sight, as far as I can tell. Crawlers crawling everywhere.
– Alex 2020-12-23 00:19 UTC
---
I installed the “surge protection” I’ve been using for Oddmuse, too: If you make more than 20 requests in 20s, you get banned for always increasing periods. Hey, I’m using Gemini status 44 at long last!
I’m thinking about checking whether the last twenty URIs requested are “plausible” – if somebody is requesting a lot of HTML pages, or raw pages, then that’s a sign of a crawler just following all the links and perhaps that deserves to get banned even if it’s slow enough.
– Alex 2020-12-23 00:25 UTC
---
I don’t want it for Gemini, but Gemini is part of the greater Internet, so I have to deal with autonomous agents. If I didn’t, I wouldn’t have a Gemini server (or a gopher server, or a web server, or ...). Are you familiar with King Canute?
– Sean Conner 2020-12-23 06:22 UTC
---
Yeah, it’s true: we’re out in the open Internet and therefore we always have to defend against bots and crawlers, and I hate it. As for Cnut, he knew of the incoming tide and knew that he was powerless to command it. Yet, he didn’t drown, he didn’t build his house where the tide would wash it away, nor plant his fields where they would drown, and neither do I feel obligated to welcome the crawling tide, or to accommodate the creators of the crawling tide, or bow respectfully as the crawlers eat my CPU and produce more CO₂. Instead, I will build fences to hold back the crawlers, and rebuke their creators, and tell anybody who thinks that building autonomous agents to crawl the net is a solution for a problem that either their problem does not need solving or that their solution is lazy and that they should try harder.
I liked it better when I wrote emails back and forth to the creator of the only crawler.
Perhaps I should write up a different proposal.
To add your site to this new search engine, you provide the URL of your own index. The index is a gzipped Berkley DB where the keys are words (stemming and all that is optional on the search engine side, the index does not have to do this) and the values are URIs, furthermore, the URIs themselves are also keys, with values being the ISO language code. I’d have to check how well that works, since I know nothing of search engines.
Even if the search engine wants to do trigram search, they can still do it, I think.
If we don’t want to tie ourselves down, we could use a simple gemtext format:
=> URI all the unique words separated by spaces in any order
If the language is very important, we could use the language of the header. I still think compression is probably important so I’d say we use something like “text/gemini+gzip; lang=de-CH; charset=utf-8”.
Let’s give this a quick try:
#!/usr/bin/env perl use Modern::Perl; use File::Slurper qw(read_dir read_text); use URI::Escape; binmode STDOUT, ":utf8"; my $dir = shift or die "No directory provided\n"; my @files = read_dir($dir); for my $file (@files) { my $data = read_text("$dir/$file"); my %result; # parsing Oddmuse data files like mail or HTTP headers while ($data =~ /(\S+?): (.*?)(?=\n[^ \t]|\Z)/gs) { my ($key, $value) = ($1, $2); $value =~ s/\n\t/\n/g; $result{$key} = $value; } my $text = $result{text}; next unless $text; my %words; $words{$_}++ for $text =~ /\w+/g; my $id = $file; $id =~ s/\.pg$//; $id = uri_escape($id); say "=> gemini://alexschroeder.ch/page/$id " . join(" ", keys %words); }
Running it on a backup copy of my site:
index ~/Documents/Sibirocobombus/home/alex/alexschroeder/page \ | gzip > alexschroeder.gmi.gz
“ls -lh alexschroeder.gmi.gz” tells me the resulting file is 149MB in size and “zcat alexschroeder.gmi.gz | wc -l” tells me it has 8441 lines.
I would have to build a proof of concept search engine to check whether this is actually a reasonable format for self-indexing and submitting indexes to search engines.
– Alex 2020-12-23 14:44 UTC