robots.txt for Gemini

🗣️ From: Sean Conner (sean (a) conman.org)
📅 Sent: 2020-03-26 21:07
📧 Message 8 of 8
It was thus said that the Great solderpunk once stated:
> On Tue, Mar 24, 2020 at 05:35:08PM -0400, Sean Conner wrote:
>  
> >   Two possible solutions for robot identification:
> > 
> > 1) Allow IP addresses to be used where a user-agent would be specificifed. 
> > Some examples:
> > 
> > 	User-agent: 172.16.89.3
> > 	User-agent: 172.17.24.0/27
> > 	User-agent: fde7:a680:47d3/48
> > 
> > Yes, I'm including CIDR (Classless Inter-Domain Routing) notation to specify
> > a range of IP addresses.  And for a robot, if your IP addresss matches an IP
> > address (or range), then you need to follow the following rules.
> 
> Hmm, I'm not a huge fan of this idea (although I recognise it as a valid
> technical solution to the problem at hand, which is perhaps all you
> meant it to be). 

  Pretty much.  

> Mostly because I don't like to encourage people to
> think of IP addresses as permanently mapping to, well, just anything.
> The address of a VPN running an abusive bot today might be handed out to
> a different customer running a well-behaved bot next year.

  Fair enough.  I'm just throwing out ideas here.

> > 2) Use the fragment portion of a URL to designate a robot.  The fragment
> > portion of a URL has no meaning for a server (it does for a client).  A
> > robot could use this fact to skip it its identifier when making a request. 
> > The server MUST NOT use this information, but the logs could show it.  For
> > example, a robot could request:
> > 
> > 	gemini://example.com/robots.txt#GUS
> > 
> > A review of the logs would reveal that GUS is a robot, and the text "GUS"
> > could be placed in the User-agent: field to control it.  It SHOULD be the
> > text the robot would recognize in robots.txt.
> 
> Hmm, nice out-of-the-box thinking.  Since the suggestion has come from
> you I will assume it does not violate the letter of any RFCs, even
> though I can't shake a strange feeling that this is "abusing" the
> fragment concept a little...

  Well ... it's skating right up to the line, and may be going over it a
bit.  RFC-3986 says this about fragments:

	The fragment identifier component of a URI allows indirect
	identification of a secondary resource by reference to a primary
	resource and additional identifying information.  The identified
	secondary resource may be some portion or subset of the primary
	resource, some view on representations of the primary resource, or
	some other resource defined or described by those representations.

... and so on.  An argument could be made that a request like:

	gemini://example.com/robots.txt#Foobot

could apply, as it is "referencing" the "Foobot" section of robots.txt, but
such a claim would only be applicable to /robots.txt and not other resources
on the server.  Perhaps this could be just limited to references to
/robots.txt?

  So yes, on the line here.  And yes, it's "abusing" the fragment concept a
little ... but other than these two methods, how else would one identify a
robot on Gemini?

  -spc
---
Previous in thread (7 of 8): 🗣️ solderpunk (solderpunk (a) SDF.ORG)
View entire thread.