💾 Archived View for rawtext.club › ~sloum › geminilist › 000521.gmi captured on 2020-09-24 at 02:30:45. Gemini links have been rewritten to link to archived content

View Raw

More Information

-=-=-=-=-=-=-

<-- back to the mailing list

robots.txt for Gemini

Sean Conner sean at conman.org

Thu Mar 26 21:07:28 GMT 2020

- - - - - - - - - - - - - - - - - - - ```

It was thus said that the Great solderpunk once stated:
> On Tue, Mar 24, 2020 at 05:35:08PM -0400, Sean Conner wrote:
>  
> 
>   Two possible solutions for robot identification:
> 
> 
> 
> 1) Allow IP addresses to be used where a user-agent would be specificifed. 
> 
> Some examples:
> 
> 
> 
> 	User-agent: 172.16.89.3
> 
> 	User-agent: 172.17.24.0/27
> 
> 	User-agent: fde7:a680:47d3/48
> 
> 
> 
> Yes, I'm including CIDR (Classless Inter-Domain Routing) notation to specify
> 
> a range of IP addresses.  And for a robot, if your IP addresss matches an IP
> 
> address (or range), then you need to follow the following rules.
> 
> Hmm, I'm not a huge fan of this idea (although I recognise it as a valid
> technical solution to the problem at hand, which is perhaps all you
> meant it to be). 

  Pretty much.  

> Mostly because I don't like to encourage people to
> think of IP addresses as permanently mapping to, well, just anything.
> The address of a VPN running an abusive bot today might be handed out to
> a different customer running a well-behaved bot next year.

  Fair enough.  I'm just throwing out ideas here.

> 
> 2) Use the fragment portion of a URL to designate a robot.  The fragment
> 
> portion of a URL has no meaning for a server (it does for a client).  A
> 
> robot could use this fact to skip it its identifier when making a request. 
> 
> The server MUST NOT use this information, but the logs could show it.  For
> 
> example, a robot could request:
> 
> 
> 
> 	gemini://example.com/robots.txt#GUS
> 
> 
> 
> A review of the logs would reveal that GUS is a robot, and the text "GUS"
> 
> could be placed in the User-agent: field to control it.  It SHOULD be the
> 
> text the robot would recognize in robots.txt.
> 
> Hmm, nice out-of-the-box thinking.  Since the suggestion has come from
> you I will assume it does not violate the letter of any RFCs, even
> though I can't shake a strange feeling that this is "abusing" the
> fragment concept a little...

  Well ... it's skating right up to the line, and may be going over it abit.  RFC-3986 says this about fragments:

	The fragment identifier component of a URI allows indirect	identification of a secondary resource by reference to a primary	resource and additional identifying information.  The identified	secondary resource may be some portion or subset of the primary	resource, some view on representations of the primary resource, or	some other resource defined or described by those representations.

... and so on.  An argument could be made that a request like:

	gemini://example.com/robots.txt#Foobot

could apply, as it is "referencing" the "Foobot" section of robots.txt, butsuch a claim would only be applicable to /robots.txt and not other resourceson the server.  Perhaps this could be just limited to references to/robots.txt?

  So yes, on the line here.  And yes, it's "abusing" the fragment concept alittle ... but other than these two methods, how else would one identify arobot on Gemini?

  -spc