Crawlers on Gemini and best practices

Yes, you should respect robots.txt in my opinion. It's not compulsory,
but it's currently the best way we have to respect servers' wishes and
bandwidth constraints. There is even a companion spec for doing so,
which accompanies the main Gemini spec.

gemini://gemini.circumlunar.space/docs/companion/robots.gmi

Read the companion spec for more detail, but you're indeed correct
that bots don't advertise who they are since there's no user-agent.
Instead, we have some agreed-upon crawler categories, like
`researcher`, `indexer`, `archiver`. It sounds like you may want to
respect `researcher` and call it a day :)

Nat

On Tue, Dec 08, 2020 at 02:36:56PM +0100, Stephane Bortzmeyer wrote:
> I just developed a simple crawler for Gemini. Its goal is not to build
> another search engine but to perform some surveys of the
> geminispace. A typical result will be something like (real data, but
> limited in size):
>
> gemini://gemini.bortzmeyer.org/software/crawler/
>
> Currently, I did not yet let it loose on the Internet, because there
> are some questions I have.
>
> Is it "good practice" to follow robots.txt? There is no mention of it
> in the specification but it could work for Gemini as well as for the
> Web and I notice that some programs query this name on my server.
>
> Since Gemini (and rightly so) has no User-Agent, how can a bot
> advertise its policy and a point of contact?

---

Previous in thread (3 of 41): 🗣️ Petite Abeille (petite.abeille (a) gmail.com)

Next in thread (5 of 41): 🗣️ Stephane Bortzmeyer (stephane (a) sources.org)

View entire thread.