robots.txt for Gemini formalised

November 22, 2020 9:05 PM, "Sean Conner" <sean at conman.org> wrote:

> It was thus said that the Great Robert khuxkm Miles once stated:
> 
>> Is there any good usecase for a proxy User-Agent in robots.txt, other than
>> blocking web spiders from being able to crawl gemspace? If not, I would be
>> in favor of dropping that part of the definition.
> 
> I'm in favor of dropping that part of the definition as it doesn't make
> sense at all. Given a web based proxy at <https://example.com/gemini>, web
> crawlers will check for <https://example.com/robots.txt> for guidance, not
> <https://example.com/gemini?gemini.conman.org/robots.txt>. Web crawlers
> will not be able to crawl gemini space for two main reasons:
> 
> 1. Most server certificates are self-signed and opt out of the CA
> business. And even if a crawler where to accept self-signed
> (or non-standard CA signed) certificates, then---
> 
> 2. The Gemini protocol is NOT HTTP, so all such HTTP requests will
> fail anyway.
> 
> -spc

Well, the argument is that the crawler would access 
<https://example.com/gemini?gemini://gemini.conman.org/>, and from there 
it could access 
<https://example.com/gemini?gemini://zaibatsu.circumlunar.space/>, and 
then <https://example.com/gemini?gemini://gemini.circumlunar.space/>, and 
so on. However, I'd argue that the onus falls on example.com to set a 
robots.txt rule in <https://example.com/robots.txt> to prevent web 
crawlers from indexing anything with their proxy.

Just my two cents,
Robert "khuxkm" Miles

---

Previous in thread (9 of 70): 🗣️ Sean Conner (sean (a) conman.org)

Next in thread (11 of 70): 🗣️ Drew DeVault (sir (a) cmpwn.com)

View entire thread.