Crawlers on Gemini and best practices

🗣️ From: Côme Chilliet (come (a) chilliet.eu)
📅 Sent: 2020-12-11 10:18
📧 Message 25 of 41

Le vendredi 11 d?cembre 2020, 09:26:54 CET Stephane Bortzmeyer a ?crit :
> > - know who you are: archiver, indexer, feed-reader, researcher etc.
> > - ask for /bots.txt
> > - if 20 text/plain then
> > -- allowed = set()
> > -- denied = set()
> > -- split response by newlines, for each line
> > --- split by spaces and tabs into fields
> > ---- paths = field[0] split by ','
> > ---- if fields[2] is "allowed" and you in field[1] split by ',' then
> > allowed = allowed union paths
> > ----- if field[3] is "but" and field[5] is "denied" and you in
> > field[4] split by ',' then denied = denied union paths
> > ---- if fields[2] is "denied" and you in field[1] split by ',' then
> > denied = denied union paths
> > you always match all, never match none
> > union of paths is special:
> >     { "/a/b" } union { "/a/b/c" } ==> { "/a/b" }
> > 
> > when you request for path, find the longest match from allowed and
> > denied; if it is in allowed you're allowed, otherwise not;; when a
> > tie: undefined behaviour, do what you want.
> 
> It seems perfect.

I guess I?m not the only one needing some examples to fully understand how 
this would work?

If I get it it?s something like so:
path1,path2 archiver,crawler allowed but path3 denied
path4 * denied

---

Previous in thread (24 of 41): 🗣️ Stephane Bortzmeyer (stephane (a) sources.org)

Next in thread (26 of 41): 🗣️ Petite Abeille (petite.abeille (a) gmail.com)

View entire thread.