Crawlers on Gemini and best practices

🗣️ From: Natalie Pendragon (natpen (a) natpen.net)
📅 Sent: 2020-12-08 14:47
📧 Message 4 of 41

Yes, you should respect robots.txt in my opinion. It's not compulsory,
but it's currently the best way we have to respect servers' wishes and
bandwidth constraints. There is even a companion spec for doing so,
which accompanies the main Gemini spec.

gemini://gemini.circumlunar.space/docs/companion/robots.gmi

Read the companion spec for more detail, but you're indeed correct
that bots don't advertise who they are since there's no user-agent.
Instead, we have some agreed-upon crawler categories, like
`researcher`, `indexer`, `archiver`. It sounds like you may want to
respect `researcher` and call it a day :)

Nat

On Tue, Dec 08, 2020 at 02:36:56PM +0100, Stephane Bortzmeyer wrote:
> I just developed a simple crawler for Gemini. Its goal is not to build
> another search engine but to perform some surveys of the
> geminispace. A typical result will be something like (real data, but
> limited in size):
>
> gemini://gemini.bortzmeyer.org/software/crawler/
>
> Currently, I did not yet let it loose on the Internet, because there
> are some questions I have.
>
> Is it "good practice" to follow robots.txt? There is no mention of it
> in the specification but it could work for Gemini as well as for the
> Web and I notice that some programs query this name on my server.
>
> Since Gemini (and rightly so) has no User-Agent, how can a bot
> advertise its policy and a point of contact?

---

Previous in thread (3 of 41): 🗣️ Petite Abeille (petite.abeille (a) gmail.com)

Next in thread (5 of 41): 🗣️ Stephane Bortzmeyer (stephane (a) sources.org)

View entire thread.