November 22, 2020 9:05 PM, "Sean Conner" <sean at conman.org> wrote: > It was thus said that the Great Robert khuxkm Miles once stated: > >> Is there any good usecase for a proxy User-Agent in robots.txt, other than >> blocking web spiders from being able to crawl gemspace? If not, I would be >> in favor of dropping that part of the definition. > > I'm in favor of dropping that part of the definition as it doesn't make > sense at all. Given a web based proxy at <https://example.com/gemini>, web > crawlers will check for <https://example.com/robots.txt> for guidance, not > <https://example.com/gemini?gemini.conman.org/robots.txt>. Web crawlers > will not be able to crawl gemini space for two main reasons: > > 1. Most server certificates are self-signed and opt out of the CA > business. And even if a crawler where to accept self-signed > (or non-standard CA signed) certificates, then--- > > 2. The Gemini protocol is NOT HTTP, so all such HTTP requests will > fail anyway. > > -spc Well, the argument is that the crawler would access <https://example.com/gemini?gemini://gemini.conman.org/>, and from there it could access <https://example.com/gemini?gemini://zaibatsu.circumlunar.space/>, and then <https://example.com/gemini?gemini://gemini.circumlunar.space/>, and so on. However, I'd argue that the onus falls on example.com to set a robots.txt rule in <https://example.com/robots.txt> to prevent web crawlers from indexing anything with their proxy. Just my two cents, Robert "khuxkm" Miles
---
Previous in thread (9 of 70): 🗣️ Sean Conner (sean (a) conman.org)
Next in thread (11 of 70): 🗣️ Drew DeVault (sir (a) cmpwn.com)