Identifying robots (was Re: Open Source Proxy)

Natalie Pendragon natpen at natpen.net

Thu Jul 23 23:01:03 BST 2020

- - - - - - - - - - - - - - - - - - -

For the GUS crawl at least, the crawler doesn't identify itself _to_crawled sites, but it does obey blocks of rules in robots.txt filesaccording to user-agent. So it works without needing a user-agentheader.

It obeys user-agent of `*`, `indexer`, and `gus` in order ofincreasing importance.

There's been some talk of the generic sorts of user-agents in thepast, which I think is a really nice idea. If `indexer` is auser-agent that both sites and crawlers had some sort of informalconsensus on, then sites wouldn't need to worry about keeping up withany new indexers popping up.

Some other generic user-agent ideas, iirc, were `archiver` and`proxy`.

On Thu, Jul 23, 2020 at 04:45:50PM -0400, Sean Conner wrote:

It was thus said that the Great Jason McBrayer once stated:

This is cool, but when you stand it up, don't forget an appropriate

robots.txt!

Question---HTTP has the Use-Agent: header to help identify webbots, but

Gemini doesn't have that. How do I instruct a Gemini bot with robots.txt,

when there's no way for a Gemini bot to identify itself?

-spc