💾 Archived View for rawtext.club › ~sloum › geminilist › 000435.gmi captured on 2020-11-07 at 01:29:28. Gemini links have been rewritten to link to archived content
-=-=-=-=-=-=-
WWW indexing concerns (was: Gemini Universal Search)

Andrew Kennedy andrew at 68kmentat.com
Wed Feb 26 17:07:39 GMT 2020
- - - - - - - - - - - - - - - - - - - ```

Very cool! I want to express two points: one, I am glad to see this. As geminispace gets larger, a search method that doesn't rely on personal bookmarks will be really useful. I forget to bookmark gopher documents all the time, and Veronica really helps there.

As for crawling in general, this is something that I've been thinking about lately, and I guess now it's timely: 

HTTP/S gemini proxies allow all of the public geminispace to be indexed by Google and other services. I think that we generally consider the two current proxies to be useful (I certainly do), so to a point this is unavoidable.

proxy.vulpes.one has a robots.txt to prevent WWW crawling, which is a fair stopgap. portal.mozz.us does not, but I'm not angry about it. I meant to send out an e-mail to mozz to ask their opinion, but just haven't gotten around to it.

My issue here is that the only 2 ways to opt out of Google indexing are to use a robots.txt, or to register yourself as the domain owner and control indexing via Google's console. Both of those methods apply to the proxy website's owner, not to the gemini server's owner, because each gemini server appears like a subdirectory on the proxy's domain.

So the issue here is that the only way to opt out of being indexed is to contact each proxy maintainer and request that they make accommodations for you. That's fine with only 15 or so gemini servers, but not fair to proxy maintainers as gemini grows. It's also not enough to ask all proxies to use robots.txt, because there's nothing stopping someone from ignoring it either out of ignorance or in bad faith.

Perhaps there isn't much that can be done here, and this e-mail is little more than me venting a concern. I realize that the only way to stay anonymous on the internet is to constantly maintain your anonymity. I'm doing that in some places. But my gemini server's domain name is already tied to my IRL identity, and I wish that it was at least harder for my gemini files to be on the 1st page of a Google result.

This e-mail got a little long, and it's less formal than some of the other discussions. Sorry if it's bothersome. I'm not really a programmer, so I can't offer any solutions. But I wanted to throw this conversation into the general dialogue in case anyone has any thoughts or ideas here. Gemini is *not* the Web, after all.

- m68k


 ---- On Wed, 26 Feb 2020 07:00:02 -0500  <gemini-request at lists.orbitalfox.eu> wrote ----

 
> One technical question is the issue of how server admins can opt out 
> of having their stuff crawled.  GUS currently recognises a /robots.txt 
> resource with (I presume) identical syntax to that used for HTTP. 
> This is certainly one potential solution to the problem (and perhaps 
> the most sensible one), but we might want to consider others. 
>