Since @alderwick@merveilles.town recently posted bot stats for a git server, I decided to look at the stats of my own. I used some Perl to try and get an idea of who was looking at my git repositories.
See the table below. It's a long table. Let's see what it contains.
The Apache log files say that the requests used about 139M of bandwidth in a 24h period with about 11000 hits.
About 80M and nearly 75% of all hits are "android" browsers. They self-identify as something like "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.6422.65 Mobile Safari/537.36 (compatible; GoogleOther)". It probably warrants more probing. GoogleOther? Come on! If it really is true that 75% of the hits are actual people with browser… nearly 8000 of them per day… wow. That is confusing.
About 57000 hits or 13% are requests made by git. I still think it's a mind-blowing number of hits but who knows. At least those look like legitimate user agents.
All the other requests are small fries of 3% or less. The usual spiders, like Google, Bytespider (TikTok), MJ12 (spam), localsearch (Switzerland), Facebook (do I read this as 37 people linking to stuff in 24h?), SemrushBot (a bad actor in the search engine optimization), GPTBot (uuugh), and so on.
+-----------------+-----------+------+------------+--------+ | Identifier | Bandwidth | Hits | Percentage | Delay | +-----------------+-----------+------+------------+--------+ | android | 75255K | 7832 | 69% | 11s | | git | 56700K | 1498 | 13% | 57s | | some browser | 6138K | 755 | 6% | 113s | | googlebot | 1203K | 342 | 3% | 251s | | Bytespider | 1239K | 342 | 3% | 252s | | mj12bot | 1069K | 306 | 2% | 205s | | localsearch.ch | 233K | 46 | 0% | 1103s | | facebook | 577K | 37 | 0% | 2365s | | SemrushBot | 117K | 31 | 0% | 2533s | | iphone | 84K | 15 | 0% | 1273s | | GPTBot | 35K | 11 | 0% | 7003s | | github | 158K | 8 | 0% | 7585s | | bingbot | 23K | 6 | 0% | 10756s | | code.google.com | 27K | 4 | 0% | 5s | | yandex | 7K | 2 | 0% | 1s | | Pleroma | 3K | 2 | 0% | 0s | | openai.com | 4K | 2 | 0% | 0s | | scrapy.org | 15K | 2 | 0% | 2s | | wpbot | 4K | 2 | 0% | 0s | | ipad | 9K | 1 | 0% | 0s | +-----------------+-----------+------+------------+--------+
Looking at the IP numbers, and manually looking up the top ten results using `whois`, replacing the IP number with the organization behind it:
+---------------+------+--------+------+---------+--------------------------------+ | Organisation | Hits | Bandw. | Rel. | Interv. | Status | +---------------+------+--------+------+---------+--------------------------------+ | Google | 6485 | 9K | 57% | 13.3s | 200 (96%), 410 (3%), 404 (0%) | | Google | 1127 | 10K | 10% | 76.6s | 200 (95%), 410 (4%) | | Comcast | 984 | 39K | 8% | 59.2s | 200 (100%) | | Google | 525 | 10K | 4% | 163.9s | 200 (93%), 410 (6%) | | Comcast | 466 | 36K | 4% | 58.2s | 200 (99%), 502 (0%) | | Hetzner | 272 | 3K | 2% | 97.2s | 410 (100%) | | China Telecom | 128 | 6K | 1% | 2.4s | 200 (100%) | | Google | 91 | 5K | 0% | 1.9s | 200 (59%), 404 (36%), 500 (4%) | +---------------+------+--------+------+---------+--------------------------------+
What are they doing in my repos? Out, all of them!
I decided to just add the following `robots.txt` to the `src.alexschroeder.ch` domain:
User-agent: * Disallow: /
The web pages and git repositories are served by legit running on port 4027 with Apache acting as a proxy. I did this in my Apache config for the site, adding a robots.txt file to the `static` subdirectory for the `legit` installation and rewriting requests from the top level to that one.
# Legit <VirtualHost *:443> ServerAdmin alex@alexschroeder.ch ServerName src.alexschroeder.ch Include conf-enabled/blocklist.conf RewriteEngine On RewriteRule ^/robots.txt http://localhost:4027/static/robots.txt [proxy,last] ProxyPass / http://localhost:4027/ SSLEngine on </VirtualHost>
#Bots #Butlerian Jihad