robots.txt for Gemini formalised

🗣️ From: Sean Conner (sean (a) conman.org)
📅 Sent: 2020-11-22 22:59
📧 Message 2 of 70
It was thus said that the Great Solderpunk once stated:
> Hi folks,
> 
> There is now (finally!) an official reference on the use of robots.txt
> files in Geminispace.  Please see:
> 
> gemini://gemini.circumlunar.space/docs/companion/robots.gmi

  Nice.

> I attempted to take into account previous discussions on the mailing
> list and the currently declared practices of various well-known Gemini
> bots (broadly construed).
> 
> I don't consider this "companion spec" to necessarily be finalised at
> this point, but I am primarily interested in hearing suggestions for
> change from either authors of software which tries to respect robots.txt
> who are having problems caused by the current specification, or from
> server admins who are having bot problems who feel that the current
> specification is not working for them.

  Right now, there are two things I would change.

	1. Add "allow".  While the initial spec [1] did not have an allow
	   rule, a subsequent draft proposal [2] did, which Google is
	   pushing (as of 2019) to become an RFC [3].

	2. I would specify virtual agents as:

		Virtual-agent: archiver
		Virtual-agent: indexer

	   This makes it easier to add new virtual agents, separates the
	   namespace of agents from the namespace of virtual agents, and is
	   allowed by all current and proposed standards [4].

	   The rule I would follow is:

		Definitions:  
			specific user agent is one that is not '*'
			specific virtual agent is one that is not '*'
			generic user agent is one that is specified as '*'
			generic virtual agent is one that is '*'

		A crawler should use a block of rules:

			if it finds a specific user agent (most targetted)
			or it finds a specific virtual agent
			or it finds a generic virtual agent
			or it finds a generic user agent (least targetted)

	   I'm wavering on the generic virtual agent bit, so if you think
	   that makes this too complicated, fine, I think it can go.

> The biggest gap that I can currently see is that there is no advice on
> how often bots should re-query robots.txt to check for policy changes.
> I could find no clear advice on this for the web, either.  I would be
> happy to hear from people who've written software that uses robots.txt
> with details on what their current practices are in this respect.

  The Wikipedia page [5] lists a non-standard extension "Crawl-delay" which
informs a crawler how often they should make requests.  It might be easy to
add a field saying how often to fetch a resource.  A sample file:

# The GUS agent, plus any agent that identifies as an "indexer" is allowed
# one path in an otherwise disallowed place, and only fetch items in 10
# second increments.

User-agent: GUS
Virtual-agent: indexer
Allow: /private/butpublic
Disallow: /private
Crawl-delay: 10

# Agents that fetch feeds, should only grab every 6 hours.  "Check" is
# allowed as agents should ignore fields it doesn't understand.

Virtual-agent: feed
Disallow: /private
Check: 21600

# And a fallback.  Here we don't allow any old crawler into the private
# space, and we force them to use 20 seonds between fetches.

User-agent: *
Disallow: /private
Crawl-delay: 20

  -spc

[1]	gemini://gemini.circumlunar.space/docs/companion/robots.gmi

[2]	http://www.robotstxt.org/norobots-rfc.txt

[3]	https://developers.google.com/search/reference/robots_txt

[4]	Any field not understood by a crawler should be ignored.

[5]	https://en.wikipedia.org/wiki/Robots_exclusion_standard
---
Previous in thread (1 of 70): 🗣️ Solderpunk (solderpunk (a) posteo.net)
Next in thread (3 of 70): 🗣️ Drew DeVault (sir (a) cmpwn.com)
View entire thread.