The MJ12Bot [1] is the first robot listed in the Wikipedia's [2] robots.txt [3] file, which I find amusing for obvious reasons [4]. In the Hacker News comments [5] there's a thread [6] specifically about the MJ12Bot, and I replied to a comment about blocking it [7]. It's not that easy, because it's a distributed bot that has used 136 unique IP (Internet Protocol) addresses just last month. Because of that comment, I decided I should expand on some of those numbers here.
The first table is the number of addresses from January through June, 2019 to show they're not all from a single netblock, The address format “A.B.C.D” will represent a unique IP address, like 172.16.15.2; “A.B.C” will represent the IP addresses 172.16.15.0 to 172.16.15.255; “A.B” will represent the range 172.16.0.0 to 172.16.255.255 and finally “A” will represent the range 172.0.0.0 to 172.255.255.255.
Table: Number of distinct IP addresses used by MJ12Bot in 2019 when hitting my site Address format number ------------------------------ A.B.C.D 312 A.B.C 256 A.B 86 A 53
Next are the unique addresses from all of 2018 used by MJ12Bot:
Table: Number of distinct IP addresses used by MJ12Bot in 2018 when hitting my site Address format number ------------------------------ A.B.C.D 474 A,B.C 370 A.B 125 A 66
This wide distribution can easily explain why Wikipedia found it to ignore any rate limits set. Each individual node of MJ12Bot probably followed the rate limit, but it's a hard problem to coordinate across … what? 500 machines across the world?
It seems the best bet is to ban MJ12Bot via robots.txt:
User-agent: MJ12bot Disallow: /
While I haven't added MJ12Bot to my own robots.txt [8] file, it hasn't hit my site since they removed me from their crawl list [9], so it appears it can be tamed.
[2] https://www.wikipedia.org/
[3] https://en.wikipedia.org/robots.txt
[5] https://news.ycombinator.com/item?id=20453189
[6] https://news.ycombinator.com/item?id=20453542
[7] https://news.ycombinator.com/item?id=20455003