💾 Archived View for gemi.dev › gemini-mailing-list › 000054.gmi captured on 2024-05-26 at 15:12:18. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2023-12-28)
-=-=-=-=-=-=-
I'm going through my Gemini logs, and I'm finding this: remote=XXX.XXX.XXX.XXX status=51 request="gemini://gemini.conman.org/sourcecode/lua:1965/robots.txt" bytes=14 subject="" issuer="" remote=XXX.XXX.XXX.XXX status=51 request="gemini://gemini.conman.org/sourcecode/lua/glv-1:1965/robots.txt" bytes=14 subject="" issuer="" remote=XXX.XXX.XXX.XXX status=51 request="gemini://gemini.conman.org/sourcecode/lua/glv-1/handlers:1965/robo ts.txt" bytes=14 subject="" issuer="" remote=XXX.XXX.XXX.XXX status=51 request="gemini://gemini.conman.org/sourcecode/lua/glv-1/handlers/filesyste m.lua:1965/robots.txt" bytes=14 subject="" issuer="" remote=XXX.XXX.XXX.XXX status=51 request="gemini://gemini.conman.org/sourcecode/lua/glv-1/handlers/sample.lu a:1965/robots.txt" bytes=14 subject="" issuer="" remote=XXX.XXX.XXX.XXX status=51 request="gemini://gemini.conman.org/sourcecode/lua/glv-1/handlers/userdir.l ua:1965/robots.txt" bytes=14 subject="" issuer="" remote=XXX.XXX.XXX.XXX status=51 request="gemini://gemini.conman.org/sourcecode/lua/glv-1/msg.lua:1965/robot s.txt" bytes=14 subject="" issuer="" remote=XXX.XXX.XXX.XXX status=51 request="gemini://gemini.conman.org/sourcecode/lua/glv-1/cgi.lua:1965/robot s.txt" bytes=14 subject="" issuer="" (I'm censoring the IP to protect the guilty here) I don't mind the crawling, but I am concerned about the references to robots.txt. In the web world, robots.txt lives at the top level and *only* at the top level. I don't think there's been a official response from solderpunk about robots.txt, but I would expect it to be very similar to how it works on the web---the top level only. But a clarification would be nice (either way). In my opinion, it should only live at the top level, but I can adapt to every "directory" as well. -spc
On Sat, Mar 21, 2020 at 09:39:46PM -0400, Sean Conner wrote: > I don't mind the crawling, but I am concerned about the references to > robots.txt. In the web world, robots.txt lives at the top level and *only* > at the top level. I don't think there's been a official response from > solderpunk about robots.txt, but I would expect it to be very similar to how > it works on the web---the top level only. > > But a clarification would be nice (either way). In my opinion, it should > only live at the top level, but I can adapt to every "directory" as well. This is nicely timed, actually, as things like robots.txt are now looming larger on my personal radar than they have previously - with CAPCOM I am writing for the first time a program which automatically makes Gemini requests, and I'm very keen on making sure that it's a "good citizen". There hasn't been too much overt discussion of good Gemini citizenship yet, but now that non-human clients are becoming more common, there should be. Robots.txt is obviously part of that package. (It's *not* super relevant to feed aggregation, because nobody publishes a feed without the expectation that it is read entirely by bots, but other issues, especially rate limiting, rate) It's been many years since I read any robots.txt specs from the web. I will refresh my memory and start thinking about this, and asking questions, in the hopes that we can finalise some stuff soon. Cheers, Solderpunk
FWIW I'm 99% sure those are requests from GUS, and I agree that it should be top level only. That was a regression in GUS' crawling code, which I've now fixed! I'm still very happy to accommodate more official guidance on how robots.txt should work, but in the meantime (and in the absence of any more regressions, eep!) I plan to check top-level-only robots.txt. So sorry about the :bug:!
---
Previous Thread: [ANN] Announcing CAPCOM, a Gemini Atom aggregator