2023-07-16 Bots crawling my sites

First day back home after a nice two week holiday in the mountains. I’m looking at my web server logs. 😠

I’m thinking about blocking all bots from my website. But where to start? How about this: Check the access.log file (I use Apache as my web server). If the User Agent Field contains the word “bot” that sounds like a candidate? Let’s see!

First, let’s pick 24h worth of log files.

grep ^alexschroeder /var/log/apache2/access.log.1 | wc -l
15875

This is yesterday’s log file and it has about 16k hits for my site.

perl -ne 'print "$1\n" if /"([^"]*bot[^"]*)"$/i' \
  < /var/log/apache2/access.log.1 \
  | sort | uniq -c | sort -n | tail

This takes a line from the access.log like this: `www.emacswiki.org:443 66.249.64.38 - - [16/Jul/2023:00:00:44 +0200] "GET /emacs/SmtpAuth HTTP/1.1" 200 11056 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"` and checks if the last thing in double quotes contains the word “bot”.

Manually massaging the last ten results:

26 Pleroma
36 DataForSeoBot 😒
47 Gwene
50 DotBot → Moz → SEO 😒
84 Googlebot-Image 😒
139 YandexBot 😒
189 Googlebot 😒
254 bingbot 😒
333 Googlebot 😒

Gwene has is a RSS to News thing but also has Googlebot in its name? Weird!

I’m already blocking all Pleroma, Mastodon and Friendica because of the useless previews they try to generate, but the hits still count, of course.

RewriteEngine on
# Fediverse instances asking for previews: protect the expensive endpoints
RewriteCond %{REQUEST_URI} /(wiki|cgit|download|food|paper|hug|helmut|input|korero|check|radicale|say|mojo|software)
RewriteCond %{HTTP_USER_AGENT} Mastodon|Friendica|Pleroma [nocase]
# then it's forbidden
RewriteRule ^(.*)$ - [forbidden,last]

What’s weird is when I group by all my websites. I have about 163k hits for that log file I am looking at. If I count and group per site:

perl -ne 'print "$1 $2\n" if /^([^:]*).*"([^"]*bot[^"]*)"$/i' \
  < /var/log/apache2/access.log.1 \
  | sort | uniq -c | sort -n | tail

The last ten results, massaged:

592 campaignwiki.org Googlebot 😒
694 www.emacswiki.org YandexBot 😒
729 flying-carpet.ch MJ12bot 😒
959 www.emacswiki.org Gwene 😒
1737 www.emacswiki.org magpie-crawler 😒
1951 www.emacswiki.org bingbot 😒
2198 www.emacswiki.org Googlebot 😒
2807 www.emacswiki.org Googlebot 😒
2878 www.emacswiki.org EyeMonIT Uptime Bot 😒
13728 www.emacswiki.org SeekportBot 😒

So much shit that needs blocking!

OK, I don’t feel like manually editing the result list any more. And I don’t feel like looking at bot that I’m already blocking. The following checks that the log line contains a 200 surrounded by spaces (meaning an “OK” status response instead of some sort of error code), and it extracts the word containing “bot” and print it.

perl -ne 'print "$2\n" if / 200 .*"([^"]*?([a-z]*bot[a-z]*)[^"]*)"$/i' \
  < /var/log/apache2/access.log.1 \
  | sort | uniq -c | sort -n | tail
     99 DotBot
    123 MojeekBot
    157 AwarioBot
    283 SemanticScholarBot
    430 Applebot
    838 bot
   1349 Bot
   2603 bingbot
   4524 Googlebot
   8349 SeekportBot

A more interesting list! Let’s see.

An interesting question: What do you know about the Mojeek search engine? I like independent search engines!

Mojeek

What about this one? “Brand management made simple. Track the conversations about your business across social media, news, blogs, videos, forums, and reviews.” The Awario bot gets blocked for sure!

“Scholar” sounds great but this sounds like a scam: “The Semantic Scholar bot crawls certain domains to find academic PDFs. These PDFs are served on semanticscholar.org (opens in a new tab) so researchers can discover and understand other academic accomplishments.” The PDFs are served on a different domain? Sounds like copyright violation to me. Or perhaps the scientific journals are looking for pirate copies? In any case, since I don’t write academic PDFs, this bot gets kicked in the butt.

OK, so what about Seekport? It seems to be a German search engine. I like independent search engines. But if you click around a bit, you find this self-description: “Seit 2003 die zentrale Anlaufstelle für aktuelle SEO-News, tiefgründige Datenanalysen & Meinungen zu aktuellen Trends der Plattformökonomie.” The source of current SEO news since 2003… ?? Ugh! Bloooooock.

Right now, personal sites like my diary get this piece of code in their top-level “.htaccess” file since Google and company aren’t going to return my pages as part of their results, anyway. At least they won’t be used for AI training!

RewriteEngine on
# Deny all bots
RewriteCond %{HTTP_USER_AGENT} "bot" [nocase]
RewriteRule ^ nobots.html [last]

The nobots page simply says to contact me if there is a problem.

nobots

The default for all my sites is going into “/etc/apache2/conf-enabled/blocklist.conf” and says:

RewriteEngine on
# SEO bots and other shit (for Emacs Wiki)
RewriteCond "%{HTTP_USER_AGENT}" "pcore|megaindex|semrushbot|wiederfrei|eyemonit|yandexbot|magpie-crawler|mj12bot|seekportbot|dotbot|awariobot|semanticscholarbot|seokicks-robot|ahrefsbot|trendictionbot|linkfluence|startmebot|dataforseobot" [nocase]
RewriteRule ^(.*)$ - [forbidden,last]

This is what the non-personal sites like Emacs Wiki get.

Oh, and let’s not forget the people looking for misconfigured admin consoles written in PHP:

RewriteEngine on
# Deny all idiots that are looking for borked PHP applications
# Status Code 402 is "Payment Required".
RewriteRule \.php$ - [redirect=402]

I really need to work on the code that blocks entire ASN.

ASN

#Web #Administration #Bots #Butlerian Jihad

Comments

(Please contact me if you want to remove your comment.)

⁂

Is any bot respecting robots.txt? Half serious question 🙂

– jjm 2023-07-16 19:33 UTC

jjm

---

I know that my blocking.conf file started because of bots that disregarded robots.txt files. Theses days I often wonder: where’s the easiest place to update a list just once and never having to deal with it again? And instant blocking seems like the easier solution sind I have Apache server config access on my server.

Also, some bots are listed here as not checking robots.txt at all, e.g. facebot:

Wird die robots.txt ausgelesen? Nein. – robots db

robots db

– Alex 2023-07-16 20:24 UTC

---

OK, what about the long tail, though?

perl -ne 'print "$2\n" if / 200 .*"([^"]*?([a-z]*bot[a-z]*)[^"]*)"$/i' < /var/log/apache2/access.log.1 | sort | uniq -c | sort -n
      1 BaudBot
      1 bobbinsrobots
      1 bottle
      1 cabotcove
      1 FullStoryBot
      1 GoogleBot
      1 LivelapBot
      1 PodheroBot
      1 redditbot
      1 robot
      1 robotics
      1 TelegramBot
      1 URLSuMaBot
      1 WebwikiBot
      1 YandexBot
      1 YandexRenderResourcesBot
      2 googlebot
      2 jaddjabot
      2 Pinterestbot
      2 PodBotLP
      2 Robot
      2 Semanticbot
      2 Wibybot
      2 ZumBot
      3 DuckDuckBot
      3 PixelFedBot
      3 SeznamBot
      3 SurdotlyBot
      3 WellKnownBot
      4 ArchiveBot
      4 AwarioSmartBot
      4 Newslitbot
      4 SerendeputyBot
      5 AcademicBotRTU
      5 BitSightBot
      5 Mediatoolkitbot
      7 tapbots
      7 TheFeedReaderBot
      7 trendictionbot
      8 Discordbot
      8 SiteAuditBot
     12 AhrefsBot
     14 PetalBot
     15 feedbot
     16 BLEXBot
     19 coccocbot
     19 PaperLiBot
     19 yacybot
     22 botsin
     24 startmebot
     24 SummalyBot
     26 Bingbot
     29 ZoominfoBot
     35 bots
     36 Twitterbot
     38 Elisabot
     43 FeedlyBot
     44 MojeekBot
     50 DomainStatsBot
     64 DataForSeoBot
     77 Facebot
     85 DotBot
    429 SemanticScholarBot
    467 bot
    540 Applebot
    932 Bot
   2587 bingbot
   5032 Googlebot
  18369 SeekportBot

Some of them need investigation!

First, the bots that might be OK for “public service” sites like Emacs Wiki, Community Wiki, Campaign Wiki, Oddmuse.

ArchiveBot
coccocbot: Cốc Cốc search engine (Vietnamese) 😒
Googlebot: Google search engine (USA) 😒
Bingbot: Microsoft search engine (USA) 😒
TheFeedReaderBot: RSS feeds for Telegram (Russia?)
feedbot: Word Press
SerendeputyBot: “Build your newsfeed from an index of the current open web, classified with 100mm tags (topics, linkers and more) and overlaid with your custom ranking algorithm. Tune your algorithm to your personal interests and get your newsfeed via web, email or RSS.”
SeznamBot: Seznam search engine (Czech Republic)
DuckDuckBot: DuckDuckGo search engine (USA)
ZumBot: Zum Internet search engine (South Korea)
WibyBot: “building a web of pages as it was in the earlier days of the internet.”

ArchiveBot

Feed readers are OK as long as they don’t use AI and as long as they just get the feed.

Bots that get blocked:

BitSightBot: “BitSight is a cybersecurity ratings company that analyzes companies, government agencies, and educational institutions.” 😒
Facebot: doesn’t read robots.txt and sounds suspiciously like a face reader 😒
ZoominfoBot: “Get the B2B data and software you need to connect with and close your most valuable buyers — all in one operating system.” 😒
SummalyBot: “Get any web page’s summary.” 😒
Twitterbot: Right-wing misinformers and bad actors have already earned tens of thousands of dollars under Twitter’s new ad revenue sharing program 😒
bot: greping for “ bot” gives me hits from a bot called YaK that links to linkfluence.com which leads to meltwater.com which says something about “AI Enabled Consumer Intelligence Suite” in the title 😒
bot: greping around some more gives my Synapse, “an open-source Matrix homeserver” so I’m guessing more previews for links 😒
Bot: greping for “ Bot” gives me hits from a bot called EyeMonIT Uptime Bot 😒
PaperLiBot: “After 13 years of service, Paper.li sunset on 20 April 2023”
FeedlyBot: “You tell Feedly AI what’s important to you and it flags the important insights from everywhere, including news sites, blogs, and newsletters” 😒
BLEXBot: “assists internet marketers” 😒
Discordbot: link previews 😒
SiteAuditBot: the name says it all 😒
PetalBot: petal search, unreachable 😒
tapbots: a Mastodon instance, i.e. it is already blocked
Mediatoolkitbot: indexes the web “for Determ, an online media monitoring tool used all across the world by experts in marketing, PR and other, various industries.” 😒
AcademicBotRTU: “indexing websites and documents against which to compare and match academic works of students and researchers to help educational and scientific institutions finding and preventing plagiarism.”
Newslitbot: “Media monitoring made simple.” 😒
WellKnownBot: looks like a tool to understand the ads world 😒
Semanticbot: “Analyze your data usage with Opendatasoft” 😒
Pinterest: one of the poison wells for search

Right-wing misinformers and bad actors have already earned tens of thousands of dollars under Twitter’s new ad revenue sharing program

– Alex 2023-07-16 20:59 UTC

---

I’m torn about search engines. On the one hand, they provide a service by allowing end users to search, but on the other hand, they are part of the surveillance capitalist societies, tracking visitors and selling information to companies. As time passes, the deal appears to be evermore in my disfavour: My sites cannot be found and yet they are used to train the large language models of our inane enshittified future where the world is full of garbage produced using them.

So, ideally, I only want to serve information to services that are entirely focused on the public service.

– Alex 2023-07-17 07:22 UTC

---

If you visit one of these search engines with Firefox, note how the Search bar changes to add a little green plus sign next to the magnifying glass. That’s how you add the search engine you’re looking at to your collection!

https://lieu.cblgh.org/

https://search.marginalia.nu/

https://wiby.org/

– Alex 2023-07-17 14:31 UTC

---

Years ago I wrote a tool to parse the Apache log files and print out fields requested (if you want, I can mail you a copy of the program). I just ran the tool over the log file from my blog for June:

[spc]brevard:~/web/logs.archive/2023/06>escanlog -agent boston.conman.org | sort | uniq -c | sort -rn | more
  24915 Mozilla/5.0 (Windows NT 10.0; rv:78.0) Gecko/20100101 Firefox/78.0
  21404 Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/)
  18740 Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)
  12324 Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot)
   9500 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)
   9380 Mozilla/5.0 (compatible; SeekportBot; +https://bot.seekport.com)
   8806 WF search/Nutch-1.12
   8053 CommaFeed/2.6.0 (https://github.com/Athou/commafeed)
   6657 Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
   6558 Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0
   6120 Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html)
   5663 Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/103.0.5060.134 Safari/537.36
   5464 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36
   5078 Newsboat/2.31.0 (Linux x86_64)
   4775 CCBot/2.0 (https://commoncrawl.org/faq/)
   4450 Mozilla/5.0 (compatible; DotBot/1.2; +https://opensiteexplorer.org/dotbot; help@moz.com)
   4196 Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/100.0.4889.0 Safari/537.36
   3545 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15 (Applebot/0.1; +http://www.apple.com/go/applebot)
   3170 Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36
   2914 Mozilla/5.0 (compatible; Miniflux/2.0.44; +https://miniflux.app)
   2766 Tiny Tiny RSS/23.05-a4543de (https://tt-rss.org/)
   2324 Mozilla/5.0 (compatible; Barkrowler/0.9; +https://babbar.tech/crawler)
   2126 Tiny Tiny RSS/UNKNOWN (Unsupported, Git error) (https://tt-rss.org/)
   2107 Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; spider-feedback@bytedance.com)
   2044 Mozilla/5.0 (compatible; Miniflux/2.0.43; +https://miniflux.app)
   1985 facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)

Not all of them use “bot” in their name, some use “crawler”, and at least one uses “spider”. I feel like banning bots is a lot like “whack-a-mole”—for every one you knock down, another pops up. For me, as long as they make valid requests, it’s fine, but I did get one of the worst offenders to stop crawling me.

one of the worst offenders to stop crawling me

– Sean Conner 2023-07-17 18:21 UTC

Sean Conner

---

Oh yes, MJ12Bot has been on the banned list for ages!

– Alex 2023-07-17 20:24 UTC

---

First: I’m enjoying reading your comments and some sort of internal dialog 🙂

My approach is more relaxed, and I don’t take action unless there’s something affecting the service, which is very rare and I tend to throttle those via iptables (no matter if is a bot or not); which is reactive and not very effective.

Are you planning to collect this somewhere for people to find? I have found some blocklists for Apache and nginx, but they all seem too old to be useful, and then there’s https://github.com/mitchellkrogza/nginx-ultimate-bad-bot-blocker

https://github.com/mitchellkrogza/nginx-ultimate-bad-bot-blocker

– jjm 2023-07-18 07:21 UTC

jjm

---

A long time ago I had copied a similar list: 2020-12-22 Apache config file to block user agents. But I think at the time when browser versions reached 100 it started breaking and I decided to abolish it. So right now, all I have is this:

2020-12-22 Apache config file to block user agents

RewriteEngine on

# Fediverse instances asking for previews: protect the expensive endpoints
RewriteCond %{REQUEST_URI} /(wiki|cgit|download|food|paper|hug|helmut|input|korero|check|radicale|say|mojo|software)
RewriteCond %{HTTP_USER_AGENT} Mastodon|Friendica|Pleroma [nocase]
# then it's forbidden
RewriteRule ^(.*)$ - [forbidden,last]

# SEO bots and other shit (for Emacs Wiki)
RewriteCond "%{HTTP_USER_AGENT}" "academicbotrtu|ahrefsbot|awariobot|bitsightbot|blexbot|dataforseobot|discordbot|domainstatsbot|dotbot|elisabot|eyemonit|facebot|linkfluence|magpie-crawler|megaindex|mediatoolkitbot|mj12bot|newslitbot|paperlibot|pcore|petalbot|pinterestbot|seekportbot|semanticscholarbot|semrushbot|semanticbot|seokicks-robot|siteauditbot|startmebot|summalybot|synapse|trendictionbot|twitterbot|wiederfrei|yandexbot|zoominfobot" [nocase]
RewriteRule ^(.*)$ - [forbidden,last]

# Deny all idiots that are looking for borked PHP applications
# Status Code 402 is "Payment Required".
RewriteRule \.php$ - [redirect=402]

# Private sites block all bots and crawlers. This list does no include
# social.alexschroeder.ch, communitywiki.org, www.emacswiki.org,
# oddmuse.org, orientalisch.info, korero.org.
RewriteCond "%{HTTP_HOST}" "^(alexschroeder\.ch|flying-carpet\.ch|next\.oddmuse\.org|((chat|talk)\.)?campaignwiki\.org|((archive|vault|toki|xn--vxagggm5c)\.)?transjovian\.org)$" [nocase]
RewriteCond "%{HTTP_USER_AGENT}" "!archivebot|^gwene" [nocase]
RewriteCond "%{HTTP_USER_AGENT}" "bot|crawler" [nocase]
RewriteRule ^ https://alexschroeder.ch/nobots [redirect,last]

The last section can be part of a top-level “.htaccess” file if your site is static but in my case, the site config file has Apache act as a reverse proxy for a bunch of URLs such as “/wiki”. Thus, the request is handed off to a the wiki server before the “.htaccess” file in the document directory is read. In order to prevent that, the rules need to be in the global web server directory. And I think I like it better this way: all the blocking stuff is this one “blocklist.conf” file.

– Alex 2023-07-18 09:22 UTC

---

*Gwene has is a RSS to News thing but also has Googlebot in its name? Weird!*

I think that was added because some servers allow requests if that specific string is in the User-Agent, but not otherwise (probably also allowing “real browser”-User-Agents, but they are harder to guess/emulate).

Regarding whack-a-mole: I use fail2ban to block obvious offenders automatically - I match on requests for eg anything with .php in the URL, Wordpress admin URLs, failed blog comment attempts, script kiddie-requests, etc.

In the big picture is makes no difference, but it runs automatically and it feels nice every time I see a bot “banned”.

– Adam Sjøgren 2023-07-18 10:26 UTC

Adam Sjøgren

---

Yeah! I also use fail2ban to ban IP numbers that request too many pages for a certain time window: 2019-01-20 fail2ban to watch over my sites.

2019-01-20 fail2ban to watch over my sites

In “/etc/fail2ban/filter.d/alex-apache.conf”:

[Definition]
# ANY match in the logfile counts!
failregex = ^[^:]+:[0-9]+ <HOST>

And “/etc/fail2ban/jail.d/alex.conf”:

[alex-apache]
enabled = true
port    = http,https
logpath = %(apache_access_log)s
findtime = 40
maxretry = 20

[recidive]
enabled = true

The effect is that anybody clicking “too fast” (more than 20 pages in 40 seconds) gets banned for 10min; people who get banned this way for three times in one day get banned for one week. (These are the default “recidive” setting.)

The only thing I’d like to add is that if three IP numbers from the same ASN are banned, the entire ASN should be banned for a week.

– Alex 2023-07-18 10:39 UTC

---

Ooooh, this is terrible, so many bots after bots after bots... I see Applebot, what’s that?

As for Petal, I am surprised that it does not work right now, it is the default search engine on Huawei phones right now - not sure if evil, but chinese, that’s for sure 😀

– Peter Kotrčka 2023-07-18 21:38 UTC

Peter Kotrčka

---

Applebot is the web crawler for Apple. Products like Siri and Spotlight Suggestions use Applebot. – About Applebot

About Applebot

If I click on petalsearch.com, nothing happens – pings are returned by there is no website coming up. I don’t know. Perhaps they’re checking the user agent string, too, answering only for Huawei phones? 😀

– Alex 2023-07-19 06:21 UTC

---

@sam posts:

@sam

User-agent: GPTBot
Disallow: /

– Alex 2023-08-12 12:38 UTC