2024-05-31 The bots crawling our repos

Since @alderwick@merveilles.town recently posted bot stats for a git server, I decided to look at the stats of my own. I used some Perl to try and get an idea of who was looking at my git repositories.

some Perl

See the table below. It's a long table. Let's see what it contains.

The Apache log files say that the requests used about 139M of bandwidth in a 24h period with about 11000 hits.

About 80M and nearly 75% of all hits are "android" browsers. They self-identify as something like "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.6422.65 Mobile Safari/537.36 (compatible; GoogleOther)". It probably warrants more probing. GoogleOther? Come on! If it really is true that 75% of the hits are actual people with browser… nearly 8000 of them per day… wow. That is confusing.

About 57000 hits or 13% are requests made by git. I still think it's a mind-blowing number of hits but who knows. At least those look like legitimate user agents.

All the other requests are small fries of 3% or less. The usual spiders, like Google, Bytespider (TikTok), MJ12 (spam), localsearch (Switzerland), Facebook (do I read this as 37 people linking to stuff in 24h?), SemrushBot (a bad actor in the search engine optimization), GPTBot (uuugh), and so on.

+-----------------+-----------+------+------------+--------+
|   Identifier    | Bandwidth | Hits | Percentage | Delay  |
+-----------------+-----------+------+------------+--------+
| android         | 75255K    | 7832 | 69%        | 11s    |
| git             | 56700K    | 1498 | 13%        | 57s    |
| some browser    | 6138K     |  755 | 6%         | 113s   |
| googlebot       | 1203K     |  342 | 3%         | 251s   |
| Bytespider      | 1239K     |  342 | 3%         | 252s   |
| mj12bot         | 1069K     |  306 | 2%         | 205s   |
| localsearch.ch  | 233K      |   46 | 0%         | 1103s  |
| facebook        | 577K      |   37 | 0%         | 2365s  |
| SemrushBot      | 117K      |   31 | 0%         | 2533s  |
| iphone          | 84K       |   15 | 0%         | 1273s  |
| GPTBot          | 35K       |   11 | 0%         | 7003s  |
| github          | 158K      |    8 | 0%         | 7585s  |
| bingbot         | 23K       |    6 | 0%         | 10756s |
| code.google.com | 27K       |    4 | 0%         | 5s     |
| yandex          | 7K        |    2 | 0%         | 1s     |
| Pleroma         | 3K        |    2 | 0%         | 0s     |
| openai.com      | 4K        |    2 | 0%         | 0s     |
| scrapy.org      | 15K       |    2 | 0%         | 2s     |
| wpbot           | 4K        |    2 | 0%         | 0s     |
| ipad            | 9K        |    1 | 0%         | 0s     |
+-----------------+-----------+------+------------+--------+

Looking at the IP numbers, and manually looking up the top ten results using `whois`, replacing the IP number with the organization behind it:

the IP numbers

+---------------+------+--------+------+---------+--------------------------------+
| Organisation  | Hits | Bandw. | Rel. | Interv. |             Status             |
+---------------+------+--------+------+---------+--------------------------------+
| Google        | 6485 | 9K     | 57%  | 13.3s   | 200 (96%), 410 (3%), 404 (0%)  |
| Google        | 1127 | 10K    | 10%  | 76.6s   | 200 (95%), 410 (4%)            |
| Comcast       |  984 | 39K    | 8%   | 59.2s   | 200 (100%)                     |
| Google        |  525 | 10K    | 4%   | 163.9s  | 200 (93%), 410 (6%)            |
| Comcast       |  466 | 36K    | 4%   | 58.2s   | 200 (99%), 502 (0%)            |
| Hetzner       |  272 | 3K     | 2%   | 97.2s   | 410 (100%)                     |
| China Telecom |  128 | 6K     | 1%   | 2.4s    | 200 (100%)                     |
| Google        |   91 | 5K     | 0%   | 1.9s    | 200 (59%), 404 (36%), 500 (4%) |
+---------------+------+--------+------+---------+--------------------------------+

What are they doing in my repos? Out, all of them!

I decided to just add the following `robots.txt` to the `src.alexschroeder.ch` domain:

User-agent: *
Disallow: /

The web pages and git repositories are served by legit running on port 4027 with Apache acting as a proxy. I did this in my Apache config for the site, adding a robots.txt file to the `static` subdirectory for the `legit` installation and rewriting requests from the top level to that one.

legit

# Legit                                                                                                                                 
<VirtualHost *:443>                                                                                                                     
    ServerAdmin alex@alexschroeder.ch                                                                                                   
    ServerName src.alexschroeder.ch                                                                                                     
    Include conf-enabled/blocklist.conf                                                                                                 
    RewriteEngine On                                                                                                                    
    RewriteRule ^/robots.txt http://localhost:4027/static/robots.txt [proxy,last]                                                       
    ProxyPass / http://localhost:4027/                                                                                                  
    SSLEngine on                                                                                                                        
</VirtualHost>                                                                                                                          

​#Bots ​#Butlerian Jihad