It was thus said that the Great Solderpunk once stated: > Hi folks, > > There is now (finally!) an official reference on the use of robots.txt > files in Geminispace. Please see: > > gemini://gemini.circumlunar.space/docs/companion/robots.gmi Nice. > I attempted to take into account previous discussions on the mailing > list and the currently declared practices of various well-known Gemini > bots (broadly construed). > > I don't consider this "companion spec" to necessarily be finalised at > this point, but I am primarily interested in hearing suggestions for > change from either authors of software which tries to respect robots.txt > who are having problems caused by the current specification, or from > server admins who are having bot problems who feel that the current > specification is not working for them. Right now, there are two things I would change. 1. Add "allow". While the initial spec [1] did not have an allow rule, a subsequent draft proposal [2] did, which Google is pushing (as of 2019) to become an RFC [3]. 2. I would specify virtual agents as: Virtual-agent: archiver Virtual-agent: indexer This makes it easier to add new virtual agents, separates the namespace of agents from the namespace of virtual agents, and is allowed by all current and proposed standards [4]. The rule I would follow is: Definitions: specific user agent is one that is not '*' specific virtual agent is one that is not '*' generic user agent is one that is specified as '*' generic virtual agent is one that is '*' A crawler should use a block of rules: if it finds a specific user agent (most targetted) or it finds a specific virtual agent or it finds a generic virtual agent or it finds a generic user agent (least targetted) I'm wavering on the generic virtual agent bit, so if you think that makes this too complicated, fine, I think it can go. > The biggest gap that I can currently see is that there is no advice on > how often bots should re-query robots.txt to check for policy changes. > I could find no clear advice on this for the web, either. I would be > happy to hear from people who've written software that uses robots.txt > with details on what their current practices are in this respect. The Wikipedia page [5] lists a non-standard extension "Crawl-delay" which informs a crawler how often they should make requests. It might be easy to add a field saying how often to fetch a resource. A sample file: # The GUS agent, plus any agent that identifies as an "indexer" is allowed # one path in an otherwise disallowed place, and only fetch items in 10 # second increments. User-agent: GUS Virtual-agent: indexer Allow: /private/butpublic Disallow: /private Crawl-delay: 10 # Agents that fetch feeds, should only grab every 6 hours. "Check" is # allowed as agents should ignore fields it doesn't understand. Virtual-agent: feed Disallow: /private Check: 21600 # And a fallback. Here we don't allow any old crawler into the private # space, and we force them to use 20 seonds between fetches. User-agent: * Disallow: /private Crawl-delay: 20 -spc [1] gemini://gemini.circumlunar.space/docs/companion/robots.gmi [2] http://www.robotstxt.org/norobots-rfc.txt [3] https://developers.google.com/search/reference/robots_txt [4] Any field not understood by a crawler should be ignored. [5] https://en.wikipedia.org/wiki/Robots_exclusion_standard
---
Previous in thread (1 of 70): 🗣️ Solderpunk (solderpunk (a) posteo.net)
Next in thread (3 of 70): 🗣️ Drew DeVault (sir (a) cmpwn.com)