Crawlers on Gemini and best practices

🗣️ From: Stephane Bortzmeyer (stephane (a) sources.org)
📅 Sent: 2020-12-11 08:26
📧 Message 22 of 41

On Thu, Dec 10, 2020 at 11:37:34PM +0530,
 Sudipto Mallick <smallick.dev at gmail.com> wrote 
 a message of 40 lines which said:

> 'bots.txt' for gemini bots and crawlers.

Interesting. The good thing is that it moves away from robots.txt
(underspecified, full of variants, impossible to know what a good bot
should do).

> - know who you are: archiver, indexer, feed-reader, researcher etc.
> - ask for /bots.txt
> - if 20 text/plain then
> -- allowed = set()
> -- denied = set()
> -- split response by newlines, for each line
> --- split by spaces and tabs into fields
> ---- paths = field[0] split by ','
> ---- if fields[2] is "allowed" and you in field[1] split by ',' then
> allowed = allowed union paths
> ----- if field[3] is "but" and field[5] is "denied" and you in
> field[4] split by ',' then denied = denied union paths
> ---- if fields[2] is "denied" and you in field[1] split by ',' then
> denied = denied union paths
> you always match all, never match none
> union of paths is special:
>     { "/a/b" } union { "/a/b/c" } ==> { "/a/b" }
> 
> when you request for path, find the longest match from allowed and
> denied; if it is in allowed you're allowed, otherwise not;; when a
> tie: undefined behaviour, do what you want.

It seems perfect.

---

Previous in thread (21 of 41): 🗣️ Stephane Bortzmeyer (stephane (a) sources.org)

Next in thread (23 of 41): 🗣️ Stephane Bortzmeyer (stephane (a) sources.org)

View entire thread.