💾 Archived View for gmi.runtimeterror.dev › blocking-ai-crawlers › index.gmi captured on 2024-05-10 at 10:40:07. Gemini links have been rewritten to link to archived content
-=-=-=-=-=-=-
2024-04-12 ~ 2024-04-14
I've seen some recent posts from folks like Cory Dransfeldt [1] and Ethan Marcotte [2] about how (and *why*) to prevent your personal website from being slurped up by the crawlers that AI companies use to actively enshittify the internet [3]. I figured it was past time for me to hop on board with this, so here we are.
[3] actively enshittify the internet
My initial approach was to use Hugo's robots.txt templating [4] to generate a `robots.txt` file based on a list of bad bots I got from ai.robots.txt on GitHub [5].
[4] Hugo's robots.txt templating
I dumped that list into my `config/params.toml` file, *above* any of the nested elements (since toml is kind of picky about that...).
robots = [ "AdsBot-Google", "Amazonbot", "anthropic-ai", "Applebot", "AwarioRssBot", "AwarioSmartBot", "Bytespider", "CCBot", "ChatGPT", "ChatGPT-User", "Claude-Web", "ClaudeBot", "cohere-ai", "DataForSeoBot", "Diffbot", "FacebookBot", "Google-Extended", "GPTBot", "ImagesiftBot", "magpie-crawler", "omgili", "Omgilibot", "peer39_crawler", "PerplexityBot", "YouBot" ] [author] name = "John Bowdre"
I then created a new template in `layouts/robots.txt`:
Sitemap: {{ .Site.BaseURL }}/sitemap.xml User-agent: * Disallow: {{ range .Site.Params.robots }} User-agent: {{ . }} {{- end }} Disallow: /
And enabled the template processing for this in my `config/hugo.toml` file:
enableRobotsTXT = true
Now Hugo will generate the following `robots.txt` file for me:
Sitemap: https://runtimeterror.dev//sitemap.xml User-agent: * Disallow: User-agent: AdsBot-Google User-agent: Amazonbot User-agent: anthropic-ai User-agent: Applebot User-agent: AwarioRssBot User-agent: AwarioSmartBot User-agent: Bytespider User-agent: CCBot User-agent: ChatGPT User-agent: ChatGPT-User User-agent: Claude-Web User-agent: ClaudeBot User-agent: cohere-ai User-agent: DataForSeoBot User-agent: Diffbot User-agent: FacebookBot User-agent: Google-Extended User-agent: GPTBot User-agent: ImagesiftBot User-agent: magpie-crawler User-agent: omgili User-agent: Omgilibot User-agent: peer39_crawler User-agent: PerplexityBot User-agent: YouBot Disallow: /
Cool!
I also dropped the following into `static/ai.txt` for good measure [6]:
# Spawning AI # Prevent datasets from using the following file types User-Agent: * Disallow: / Disallow: *
That's all well and good, but these files carry all the weight and authority of a "No Soliciting" sign. Do I *really* trust these bots to honor it?
I'm hosting this site on Neocities [7], and Neocities unfortunately (though perhaps wisely) doesn't give me control of the web server there. But the site is fronted by Cloudflare, and that does give me a lot of options for blocking stuff I don't want.
So I added a WAF Custom Rule [8] to block those unwanted bots. (I could have used their User Agent Blocking [9] to accomplish the same, but you can only set 10 of those on the free tier. I can put all the user agents together in a single WAF Custom Rule.)
Here's the expression I'm using:
(http.user_agent contains "AdsBot-Google") or (http.user_agent contains "Amazonbot") or (http.user_agent contains "anthropic-ai") or (http.user_agent contains "Applebot") or (http.user_agent contains "AwarioRssBot") or (http.user_agent contains "AwarioSmartBot") or (http.user_agent contains "Bytespider") or (http.user_agent contains "CCBot") or (http.user_agent contains "ChatGPT-User") or (http.user_agent contains "ClaudeBot") or (http.user_agent contains "Claude-Web") or (http.user_agent contains "cohere-ai") or (http.user_agent contains "DataForSeoBot") or (http.user_agent contains "FacebookBot") or (http.user_agent contains "Google-Extended") or (http.user_agent contains "GoogleOther") or (http.user_agent contains "GPTBot") or (http.user_agent contains "ImagesiftBot") or (http.user_agent contains "magpie-crawler") or (http.user_agent contains "Meltwater") or (http.user_agent contains "omgili") or (http.user_agent contains "omgilibot") or (http.user_agent contains "peer39_crawler") or (http.user_agent contains "peer39_crawler/1.0") or (http.user_agent contains "PerplexityBot") or (http.user_agent contains "Seekr") or (http.user_agent contains "YouBot")
Image: Creating a custom WAF rule in Cloudflare's web UI
And checking on that rule ~24 hours later, I can see that it's doing some good:
Image: It's blocked 102 bot hits already
See ya, AI bots!
---
Prettify Hugo RSS Feeds with XSLT
Self-Hosted Gemini Capsule with gempost and GitHub Actions
---