Crawlers on Gemini and best practices

🗣️ From: Stephane Bortzmeyer (stephane (a) sources.org)
📅 Sent: 2020-12-10 16:35
📧 Message 15 of 41

On Thu, Dec 10, 2020 at 02:43:11PM +0100,
 Stephane Bortzmeyer <stephane at sources.org> wrote 
 a message of 26 lines which said:

> The spec is quite vague about the *order* of directives.

Another example of the fact that you cannot rely on robots.txt:
regexps. The official site <http://www.robotstxt.org/robotstxt.html>
is crystal-clear: "Note also that globbing and regular expression are
not supported in either the User-agent or Disallow lines".

But in the wild you find things like
<gemini://drewdevault.com/robots.txt>:

User-Agent: gus
Disallow: /cgi-bin/web.sh?*

Opinion: may be we should specify a syntax for Gemini's robots.txt,
not relying on the broken Web one?

---

Previous in thread (14 of 41): 🗣️ Stephane Bortzmeyer (stephane (a) sources.org)

Next in thread (16 of 41): 🗣️ Petite Abeille (petite.abeille (a) gmail.com)

View entire thread.