💾 Archived View for rawtext.club › ~sloum › geminilist › 000519.gmi captured on 2020-09-24 at 02:30:50. Gemini links have been rewritten to link to archived content

View Raw

More Information

-=-=-=-=-=-=-

<-- back to the mailing list

robots.txt for Gemini

Sean Conner sean at conman.org

Tue Mar 24 21:35:08 GMT 2020

- - - - - - - - - - - - - - - - - - - ```

It was thus said that the Great solderpunk once stated:
> The biggest question, in my mind, is what to do about user-agents, which
> Gemini lacks (by design, as they are a component of the browser
> fingerprinting problem, and because they encourage content developers to
> serve browser-specific content which is a bad thing IMHO).  The 2019 RFC
> says "The product token SHOULD be part of the identification string that
> the crawler sends to the service" (where "product token" is bizarre and
> disappointingly commercial alternative terminology for "user-agent" in
> this document), so the fact that Gemini doesn't send one is not
> technically a violation.

  Two possible solutions for robot identification:

1) Allow IP addresses to be used where a user-agent would be specificifed. Some examples:

	User-agent: 172.16.89.3	User-agent: 172.17.24.0/27	User-agent: fde7:a680:47d3/48

Yes, I'm including CIDR (Classless Inter-Domain Routing) notation to specifya range of IP addresses.  And for a robot, if your IP addresss matches an IPaddress (or range), then you need to follow the following rules.

2) Use the fragment portion of a URL to designate a robot.  The fragmentportion of a URL has no meaning for a server (it does for a client).  Arobot could use this fact to skip it its identifier when making a request. The server MUST NOT use this information, but the logs could show it.  Forexample, a robot could request:

	gemini://example.com/robots.txt#GUS

A review of the logs would reveal that GUS is a robot, and the text "GUS"could be placed in the User-agent: field to control it.  It SHOULD be thetext the robot would recognize in robots.txt.  One clarification, this:

	gemini://example.com/robots.txt#foo%20bot

would be 

	User-agent: foo bot

but a robot ID SHOULD NOT contain spaces---it SHOULD be one word.

  Anyway, that's my ideas.

  -spc