💾 Archived View for gmi.runtimeterror.dev › blocking-ai-crawlers › index.gmi captured on 2024-08-25 at 00:27:55. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

2024-04-12 ~ 2024-06-13

Blocking AI Crawlers

I've seen some recent posts from folks like Cory Dransfeldt [1] and Ethan Marcotte [2] about how (and *why*) to prevent your personal website from being slurped up by the crawlers that AI companies use to actively enshittify the internet [3]. I figured it was past time for me to hop on board with this, so here we are.

[1] Cory Dransfeldt

[2] Ethan Marcotte

[3] actively enshittify the internet

My initial approach was to use Hugo's robots.txt templating [4] to generate a `robots.txt` file based on a list of bad bots I got from ai.robots.txt on GitHub [5].

[4] Hugo's robots.txt templating

[5] ai.robots.txt on GitHub

I dumped that list into my `config/params.toml` file, *above* any of the nested elements (since toml is kind of picky about that...).

robots = [
  "AdsBot-Google",
  "Amazonbot",
  "anthropic-ai",
  "Applebot-Extended",
  "AwarioRssBot",
  "AwarioSmartBot",
  "Bytespider",
  "CCBot",
  "ChatGPT",
  "ChatGPT-User",
  "Claude-Web",
  "ClaudeBot",
  "cohere-ai",
  "DataForSeoBot",
  "Diffbot",
  "FacebookBot",
  "Google-Extended",
  "GPTBot",
  "ImagesiftBot",
  "magpie-crawler",
  "omgili",
  "Omgilibot",
  "peer39_crawler",
  "PerplexityBot",
  "YouBot"
]

I then created a new template in `layouts/robots.txt`:

Sitemap: {{ .Site.BaseURL }}/sitemap.xml
# hello robots 
# let's be friends <3
User-agent: *
Disallow:
# except for these bots which are not friends:
{{ range .Site.Params.bad_robots }}
User-agent: {{ . }}
{{- end }}
Disallow: /

And enabled the template processing for this in my `config/hugo.toml` file:

enableRobotsTXT = true

Now Hugo will generate the following `robots.txt` file for me:

Sitemap: https://runtimeterror.dev/sitemap.xml
# hello robots 
# let's be friends <3
User-agent: *
Disallow:
# except for these bots which are not friends:
User-agent: AdsBot-Google
User-agent: Amazonbot
User-agent: anthropic-ai
User-agent: Applebot-Extended
User-agent: AwarioRssBot
User-agent: AwarioSmartBot
User-agent: Bytespider
User-agent: CCBot
User-agent: ChatGPT
User-agent: ChatGPT-User
User-agent: Claude-Web
User-agent: ClaudeBot
User-agent: cohere-ai
User-agent: DataForSeoBot
User-agent: Diffbot
User-agent: FacebookBot
User-agent: Google-Extended
User-agent: GPTBot
User-agent: ImagesiftBot
User-agent: magpie-crawler
User-agent: omgili
User-agent: Omgilibot
User-agent: peer39_crawler
User-agent: PerplexityBot
User-agent: YouBot
Disallow: /

Cool!

I also dropped the following into `static/ai.txt` for good measure [6]:

[6] good measure

# Spawning AI
# Prevent datasets from using the following file types
User-Agent: *
Disallow: /
Disallow: *

That's all well and good, but these files carry all the weight and authority of a "No Soliciting" sign. Do I *really* trust these bots to honor it?

I'm hosting this site on Neocities [7], and Neocities unfortunately (though perhaps wisely) doesn't give me control of the web server there. But the site is fronted by Cloudflare, and that does give me a lot of options for blocking stuff I don't want.

[7] on Neocities

So I added a WAF Custom Rule [8] to block those unwanted bots. (I could have used their User Agent Blocking [9] to accomplish the same, but you can only set 10 of those on the free tier. I can put all the user agents together in a single WAF Custom Rule.)

[8] WAF Custom Rule

[9] User Agent Blocking

Here's the expression I'm using:

(http.user_agent contains "AdsBot-Google") or (http.user_agent contains "Amazonbot") or (http.user_agent contains "anthropic-ai") or (http.user_agent contains "Applebot-Extended") or (http.user_agent contains "AwarioRssBot") or (http.user_agent contains "AwarioSmartBot") or (http.user_agent contains "Bytespider") or (http.user_agent contains "CCBot") or (http.user_agent contains "ChatGPT-User") or (http.user_agent contains "ClaudeBot") or (http.user_agent contains "Claude-Web") or (http.user_agent contains "cohere-ai") or (http.user_agent contains "DataForSeoBot") or (http.user_agent contains "FacebookBot") or (http.user_agent contains "Google-Extended") or (http.user_agent contains "GoogleOther") or (http.user_agent contains "GPTBot") or (http.user_agent contains "ImagesiftBot") or (http.user_agent contains "magpie-crawler") or (http.user_agent contains "Meltwater") or (http.user_agent contains "omgili") or (http.user_agent contains "omgilibot") or (http.user_agent contains "peer39_crawler") or (http.user_agent contains "peer39_crawler/1.0") or (http.user_agent contains "PerplexityBot") or (http.user_agent contains "Seekr") or (http.user_agent contains "YouBot")

Image: Creating a custom WAF rule in Cloudflare's web UI

And checking on that rule ~24 hours later, I can see that it's doing some good:

Image: It's blocked 102 bot hits already

See ya, AI bots!

---

📧 Reply by email

SilverBullet: Self-Hosted Knowledge Management Web App

Generate a Dynamic robots.txt File in Hugo with External Data Sources

Automate Packer Builds with GithHub Actions

---

Home

This page on the big web

Blocking AI Crawlers

Related articles