2024-10-18 Firefox/72

I’ve now seen this shit running on systems hosted by: Amazon, Google, Carnegie Mellon University, NVidia, Oracle, University of Texas, San Diego Super Computing Center, Alibaba, University of Edinburgh, Huawei, Intel, Coreweave, Samsung, Hong Kong University, University of Washington… and dozens, or perhaps hundreds more. This shit makes requests from the United States, Germany, China, Australia, Hong Kong, Singapore, United Kingdom, Netherlands, Portugal… and so on, and so on. – Block This Shit by @alexskunz@mas.to

Block This Shit

I decided to see whether this bot was scraping my pages, too. Here's how to use network-lookup to see who's doing it:

network-lookup

grep "Firefox/72" /var/log/apache2/access.log \
 | tail -n 100 \
 | bin/admin/network-lookup \
 > result.log

What I intend to do is to get all the `ipset add …` instructions from that log file and add it to the ban hammer, ban-cidr.

ban-cidr

Let's look at the organisations hosting that bot:

+------------------+------+------------------------------------+
|      Range       | Hits |                Org                 |
+------------------+------+------------------------------------+
| 160.91.0.0/16    |   15 | Oak Ridge National Laboratory      |
|                  |      | / OREN                             |
| 20.64.0.0/10     |   12 | MSFT / Microsoft Corporation       |
| 3.36.0.0/14      |   10 | AMAZON-ICN / AWS Asia Pacific      |
|                  |      | (Seoul) Region                     |
| 35.222.104.0/21  |    7 | Google LLC / GOOGLE-CLOUD          |
| 34.30.0.0/16     |    6 | GOOGL-2 / Google LLC               |
| 35.239.48.0/20   |    6 | Google LLC / GOOGLE-CLOUD          |
| 34.27.0.0/16     |    4 | GOOGL-2 / Google LLC               |
| 31.13.168.0/23   |    3 | NET-DEM-4SITSOLUTIONS /            |
|                  |      | TWK-NET-CUSTOMER1                  |
| 117.161.0.0/16   |    3 | CMNET / China Mobile /             |
|                  |      | ORG-CM1-AP / China Mobile          |
|                  |      | communications corporation         |
| 82.156.0.0/18    |    2 | IPv4 address block not             |
|                  |      | managed by the RIPE NCC /          |
|                  |      | NON-RIPE-NCC-MANAGED-ADDRESS-BLOCK |
| 34.121.48.0/20   |    2 | GOOGL-2 / Google LLC               |
| 35.223.240.0/20  |    2 | GOOGLE-CLOUD / Google LLC          |
| 35.225.64.0/20   |    2 | GOOGLE-CLOUD / Google LLC          |
| 3.34.0.0/15      |    2 | AMAZON-ICN / AWS Asia Pacific      |
|                  |      | (Seoul) Region                     |
| 15.164.0.0/15    |    2 | AT-88-Z / Amazon Technologies      |
|                  |      | Inc.                               |
| 34.123.224.0/20  |    1 | GOOGL-2 / Google LLC               |
| 34.45.0.0/16     |    1 | GOOGL-2 / Google LLC               |
| 47.252.0.0/18    |    1 | ALIBABA CLOUD - US                 |
| 45.38.206.0/24   |    1 | EGN-22 / EGIHosting                |
| 34.72.112.0/20   |    1 | Google LLC / GOOGL-2               |
| 121.30.0.0/16    |    1 | UNICOM-SX / CNC Group CHINA169     |
|                  |      | Shan1xi Province Network           |
| 104.198.48.0/20  |    1 | GOOGLE-CLOUD / Google LLC          |
| 34.68.176.0/20   |    1 | GOOGL-2 / Google LLC               |
| 34.68.96.0/20    |    1 | GOOGL-2 / Google LLC               |
| 34.16.0.0/17     |    1 | GOOGL-2 / Google LLC               |
| 34.122.64.0/20   |    1 | GOOGL-2 / Google LLC               |
| 35.239.16.0/20   |    1 | GOOGLE-CLOUD / Google LLC          |
| 202.120.234.0/24 |    1 | CERNET-CN / Beijing, 100084        |
| 104.154.160.0/20 |    1 | GOOGLE-CLOUD / Google LLC          |
| 34.70.112.0/20   |    1 | Google LLC / GOOGL-2               |
| 35.224.160.0/20  |    1 | GOOGLE-CLOUD / Google LLC          |
| 34.66.128.0/20   |    1 | GOOGL-2 / Google LLC               |
| 34.133.16.0/20   |    1 | Google LLC / GOOGL-2               |
| 109.171.128.0/18 |    1 | KAUST Section 1 / King             |
|                  |      | Abdullah University of             |
|                  |      | Science and Technology             |
|                  |      | / ORG-KAUo2-RIPE /                 |
|                  |      | SA-KAUST-20091118                  |
| 34.68.144.0/20   |    1 | Google LLC / GOOGL-2               |
| 8.130.0.0/16     |    1 | Alibaba.com Singapore              |
|                  |      | E-Commerce Private Limited /       |
|                  |      | ALICLOUD                           |
| 34.122.128.0/20  |    1 | GOOGL-2 / Google LLC               |
+------------------+------+------------------------------------+

To get a feel for the level of incompetence of the engineers behind it, check out this example:

20.64.0.0/10 | 20.112.49.107 | 18/Oct/2024:14:58:37 +0200 | GET /cw?action=download;id=MattisManzelPortrait HTTP/1.1 | Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0
20.64.0.0/10 | 20.112.49.107 | 18/Oct/2024:14:58:37 +0200 | GET /cw?action=download;id=MattisManzelPortrait HTTP/1.1 | Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0
20.64.0.0/10 | 20.112.49.107 | 18/Oct/2024:14:58:38 +0200 | GET /cw?action=download;id=MattisManzelPortrait HTTP/1.1 | Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0
20.64.0.0/10 | 20.112.49.107 | 18/Oct/2024:14:58:39 +0200 | GET /wiki?action=download;id=MattisManzelPortrait HTTP/1.1 | Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0
20.64.0.0/10 | 20.112.49.107 | 18/Oct/2024:14:58:40 +0200 | GET /cw?action=download;id=MattisManzelPortrait HTTP/1.1 | Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0
20.64.0.0/10 | 20.112.49.107 | 18/Oct/2024:14:58:40 +0200 | GET /cw?action=download;id=MattisManzelPortrait HTTP/1.1 | Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0
20.64.0.0/10 | 20.112.49.107 | 18/Oct/2024:14:58:41 +0200 | GET /cw?action=download;id=MattisManzelPortrait HTTP/1.1 | Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0
20.64.0.0/10 | 20.112.49.107 | 18/Oct/2024:14:58:41 +0200 | GET /wiki?action=download;id=MattisManzelPortrait HTTP/1.1 | Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0
20.64.0.0/10 | 20.112.49.107 | 18/Oct/2024:14:58:42 +0200 | GET /cw?action=download;id=MattisManzelPortrait HTTP/1.1 | Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0
20.64.0.0/10 | 20.112.49.107 | 18/Oct/2024:14:58:42 +0200 | GET /cw?action=download;id=MattisManzelPortrait HTTP/1.1 | Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0
20.64.0.0/10 | 20.112.49.107 | 18/Oct/2024:14:58:43 +0200 | GET /cw?action=download;id=MattisManzelPortrait HTTP/1.1 | Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0
20.64.0.0/10 | 20.112.49.107 | 18/Oct/2024:14:58:43 +0200 | GET /wiki?action=download;id=MattisManzelPortrait HTTP/1.1 | Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0

Every fucking second‽

Now, a decision needs to be made:

Guess which option I'm taking?

Remember: 2023-10-04 Search engines, the deal is off!:

2023-10-04 Search engines, the deal is off!

For a while it seemed that we all benefited from search engines – authors and readers both. These days, you’ll find that search results are full of garbage sites. Big sites with the most flatulent of pages explaining in great detail why the thing you’re looking for is important and how to do it, clearly optimized for an ad company and not for a reader. Big sites that have a gazillion answers are preferred over small and individual sites. Perhaps that’s easier. Perhaps it allows them to diffuse responsibility for the garbage, I don’t know. The effect is, in any case, that there is no benefit to search engines for small site authors, either. I was unable to find my own pages on the search engines. If you you are a small site owner and you think you can find your own pages on Google and Bing, I suspect that’s because they track you. Try it on a different computer, anonymously. Perhaps you won’t find yourself, either.
In any case, if I can’t get anything in return, both as a reader and as an author, I feel that the deal is off. Why let them feed on my words for free? Nay, at a cost, since they are keeping my website busy, producing CO₂ and heating the planet for no benefit at all.
Better to block them all.

Here we go:

grep ipset result.log|sh

That's because `network-lookup` is kind enough to include the appropriate `ipset` instructions:

ipset add banlist 160.91.0.0/16 # Oak Ridge National Laboratory / OREN
ipset add banlist 20.64.0.0/10 # MSFT / Microsoft Corporation
ipset add banlist 3.36.0.0/14 # AMAZON-ICN / AWS Asia Pacific (Seoul) Region
ipset add banlist 35.222.104.0/21 # Google LLC / GOOGLE-CLOUD
ipset add banlist 34.30.0.0/16 # GOOGL-2 / Google LLC
ipset add banlist 35.239.48.0/20 # Google LLC / GOOGLE-CLOUD
ipset add banlist 34.27.0.0/16 # GOOGL-2 / Google LLC
ipset add banlist 31.13.168.0/23 # NET-DEM-4SITSOLUTIONS / TWK-NET-CUSTOMER1
ipset add banlist 117.161.0.0/16 # CMNET / China Mobile / ORG-CM1-AP / China Mobile communications corporation
ipset add banlist 82.156.0.0/18 # IPv4 address block not managed by the RIPE NCC / NON-RIPE-NCC-MANAGED-ADDRESS-BLOCK
ipset add banlist 34.121.48.0/20 # GOOGL-2 / Google LLC
ipset add banlist 35.223.240.0/20 # GOOGLE-CLOUD / Google LLC
ipset add banlist 35.225.64.0/20 # GOOGLE-CLOUD / Google LLC
ipset add banlist 3.34.0.0/15 # AMAZON-ICN / AWS Asia Pacific (Seoul) Region
ipset add banlist 15.164.0.0/15 # AT-88-Z / Amazon Technologies Inc.
ipset add banlist 34.123.224.0/20 # GOOGL-2 / Google LLC
ipset add banlist 34.45.0.0/16 # GOOGL-2 / Google LLC
ipset add banlist 47.252.0.0/18 # ALIBABA CLOUD - US
ipset add banlist 45.38.206.0/24 # EGN-22 / EGIHosting
ipset add banlist 34.72.112.0/20 # Google LLC / GOOGL-2
ipset add banlist 121.30.0.0/16 # UNICOM-SX / CNC Group CHINA169 Shan1xi Province Network
ipset add banlist 104.198.48.0/20 # GOOGLE-CLOUD / Google LLC
ipset add banlist 34.68.176.0/20 # GOOGL-2 / Google LLC
ipset add banlist 34.68.96.0/20 # GOOGL-2 / Google LLC
ipset add banlist 34.16.0.0/17 # GOOGL-2 / Google LLC
ipset add banlist 34.122.64.0/20 # GOOGL-2 / Google LLC
ipset add banlist 35.239.16.0/20 # GOOGLE-CLOUD / Google LLC
ipset add banlist 202.120.234.0/24 # CERNET-CN / Beijing, 100084
ipset add banlist 104.154.160.0/20 # GOOGLE-CLOUD / Google LLC
ipset add banlist 34.70.112.0/20 # Google LLC / GOOGL-2
ipset add banlist 35.224.160.0/20 # GOOGLE-CLOUD / Google LLC
ipset add banlist 34.66.128.0/20 # GOOGL-2 / Google LLC
ipset add banlist 34.133.16.0/20 # Google LLC / GOOGL-2
ipset add banlist 109.171.128.0/18 # KAUST Section 1 / King Abdullah University of Science and Technology / ORG-KAUo2-RIPE / SA-KAUST-20091118
ipset add banlist 34.68.144.0/20 # Google LLC / GOOGL-2
ipset add banlist 8.130.0.0/16 # Alibaba.com Singapore E-Commerce Private Limited / ALICLOUD
ipset add banlist 34.122.128.0/20 # GOOGL-2 / Google LLC

Thanks for nothing, leeches!

Oh, and of course I have an Apache config file called "blocklist.conf" where I added the following:

# Deny the image scraper
# https://imho.alex-kunz.com/2024/02/25/block-this-shit/
RewriteCond "%{HTTP_USER_AGENT}" "Firefox/72.0" [nocase]
RewriteRule ^ https://alexschroeder.ch/nobots [redirect=410,last]

​#Administration ​#Butlerian Jihad

And more!

ipset add banlist 198.82.0.0/16 # Virginia Polytechnic Institute and State Univ. / VPI-BLK
ipset add banlist 34.31.0.0/16 # Google LLC / GOOGL-2
ipset add banlist 34.72.32.0/20 # GOOGL-2 / Google LLC
ipset add banlist 35.238.176.0/20 # Google LLC / GOOGLE-CLOUD
ipset add banlist 34.173.0.0/17 # Google LLC / GOOGL-2
ipset add banlist 35.223.128.0/20 # GOOGLE-CLOUD / Google LLC
ipset add banlist 34.172.128.0/17 # Google LLC / GOOGL-2
ipset add banlist 35.202.224.0/20 # Google LLC / GOOGLE-CLOUD
ipset add banlist 34.69.112.0/20 # GOOGL-2 / Google LLC
ipset add banlist 34.136.224.0/20 # Google LLC / GOOGL-2
ipset add banlist 72.52.64.0/18 # Hurricane Electric LLC / HURRICANE-8