Natalie Pendragon natpen at natpen.net
Thu Jul 23 23:01:03 BST 2020
- - - - - - - - - - - - - - - - - - -
For the GUS crawl at least, the crawler doesn't identify itself _to_crawled sites, but it does obey blocks of rules in robots.txt filesaccording to user-agent. So it works without needing a user-agentheader.
It obeys user-agent of `*`, `indexer`, and `gus` in order ofincreasing importance.
There's been some talk of the generic sorts of user-agents in thepast, which I think is a really nice idea. If `indexer` is auser-agent that both sites and crawlers had some sort of informalconsensus on, then sites wouldn't need to worry about keeping up withany new indexers popping up.
Some other generic user-agent ideas, iirc, were `archiver` and`proxy`.
On Thu, Jul 23, 2020 at 04:45:50PM -0400, Sean Conner wrote:
It was thus said that the Great Jason McBrayer once stated:
This is cool, but when you stand it up, don't forget an appropriate
robots.txt!
Question---HTTP has the Use-Agent: header to help identify webbots, but
Gemini doesn't have that. How do I instruct a Gemini bot with robots.txt,
when there's no way for a Gemini bot to identify itself?
-spc