💾 Archived View for rawtext.club › ~sloum › geminilist › 003973.gmi captured on 2024-02-05 at 10:15:07. Gemini links have been rewritten to link to archived content

View Raw

More Information

⬅️ Previous capture (2021-11-30)

-=-=-=-=-=-=-

<-- back to the mailing list

Crawlers on Gemini and best practices

Sudipto Mallick smallick.dev at gmail.com

Thu Dec 10 18:07:34 GMT 2020

- - - - - - - - - - - - - - - - - - - 

On 12/10/20, Stephane Bortzmeyer <stephane at sources.org> wrote:

Opinion: may be we should specify a syntax for Gemini's robots.txt,
not relying on the broken Web one?Here it is:

'bots.txt' for gemini bots and crawlers.

- know who you are: archiver, indexer, feed-reader, researcher etc.- ask for /bots.txt- if 20 text/plain then-- allowed = set()-- denied = set()-- split response by newlines, for each line--- split by spaces and tabs into fields---- paths = field[0] split by ','---- if fields[2] is "allowed" and you in field[1] split by ',' thenallowed = allowed union paths----- if field[3] is "but" and field[5] is "denied" and you infield[4] split by ',' then denied = denied union paths---- if fields[2] is "denied" and you in field[1] split by ',' thendenied = denied union pathsyou always match all, never match noneunion of paths is special: { "/a/b" } union { "/a/b/c" } ==

{ "/a/b" }

when you request for path, find the longest match from allowed anddenied; if it is in allowed you're allowed, otherwise not;; when atie: undefined behaviour, do what you want.

examples:default, effectively: / all allowedor / none deniedcomplex example: /priv1,/priv2,/login all denied /cgi-bin indexer allowed but archiver denied /priv1/pub researcher allowed but blabla,meh,heh,duh denied

what do you think?