💾 Archived View for gemi.dev › gemini-mailing-list › 000309.gmi captured on 2023-11-04 at 12:40:22. Gemini links have been rewritten to link to archived content
-=-=-=-=-=-=-
Hey all! I'm experimenting with building a web app that includes some proxying of Gemini content and I was wondering if anyone has put together an open source proxy for Gemini. I am aware of https://portal.mozz.us/ and https://proxy.vulpes.one/, but I can't find the code for either of them. On a related note, I am working on a Python library to convert gmi files to html which you all may find helpful: https://git.sr.ht/~alexwennerberg/gmi2html All the best, Alex
On 2020-07-22 (Wednesday) at 19:48, alex wennerberg <alex at alexwennerberg.com> wrote: > Hey all! > > I'm experimenting with building a web app that includes some proxying of > Gemini content and I was wondering if anyone has put together an open > source proxy for Gemini. I am aware of https://portal.mozz.us/ and > https://proxy.vulpes.one/, but I can't find the code for either of them. Though the vulpes code isn't easy to find, I did find it at https://git.feuerfuchs.dev/Feuerfuchs/gopherproxy. I look forward to your work! I've been wanting to include a proxy with breadpunk.club. -- ~ acdw acdw.net | breadpunk.club/~breadw
Nice find! Thanks for the help :) On Wed Jul 22, 2020 at 9:35 PM CDT, acdw wrote: > On 2020-07-22 (Wednesday) at 19:48, alex wennerberg > <alex at alexwennerberg.com> wrote: > > > Hey all! > > > > I'm experimenting with building a web app that includes some proxying of > > Gemini content and I was wondering if anyone has put together an open > > source proxy for Gemini. I am aware of https://portal.mozz.us/ and > > https://proxy.vulpes.one/, but I can't find the code for either of them. > > Though the vulpes code isn't easy to find, I did find it at > https://git.feuerfuchs.dev/Feuerfuchs/gopherproxy. I look forward to > your work! I've been wanting to include a proxy with breadpunk.club. > > -- > ~ acdw > acdw.net | breadpunk.club/~breadw
"alex wennerberg" <alex at alexwennerberg.com> writes: > I'm experimenting with building a web app that includes some proxying of > Gemini content and I was wondering if anyone has put together an open > source proxy for Gemini. I am aware of https://portal.mozz.us/ and > https://proxy.vulpes.one/, but I can't find the code for either of them. This is cool, but when you stand it up, don't forget an appropriate robots.txt! -- +-----------------------------------------------------------+ | Jason F. McBrayer jmcbray at carcosa.net | | A flower falls, even though we love it; and a weed grows, | | even though we do not love it. -- Dogen |
It was thus said that the Great Jason McBrayer once stated: > > This is cool, but when you stand it up, don't forget an appropriate > robots.txt! Question---HTTP has the Use-Agent: header to help identify webbots, but Gemini doesn't have that. How do I instruct a Gemini bot with robots.txt, when there's no way for a Gemini bot to identify itself? -spc
> On Jul 23, 2020, at 22:45, Sean Conner <sean at conman.org> wrote: > > How do I instruct a Gemini bot with robots.txt, > when there's no way for a Gemini bot to identify itself? Through a side channel such as the TLS certificate? Robots could identify themselves there. Otherwise, only User-agent: * wide rules apply. Alternatively, some creative use of TLS fingerprinting.
For the GUS crawl at least, the crawler doesn't identify itself _to_ crawled sites, but it does obey blocks of rules in robots.txt files according to user-agent. So it works without needing a user-agent header. It obeys user-agent of `*`, `indexer`, and `gus` in order of increasing importance. There's been some talk of the generic sorts of user-agents in the past, which I think is a really nice idea. If `indexer` is a user-agent that both sites and crawlers had some sort of informal consensus on, then sites wouldn't need to worry about keeping up with any new indexers popping up. Some other generic user-agent ideas, iirc, were `archiver` and `proxy`. On Thu, Jul 23, 2020 at 04:45:50PM -0400, Sean Conner wrote: > It was thus said that the Great Jason McBrayer once stated: > > > > This is cool, but when you stand it up, don't forget an appropriate > > robots.txt! > > Question---HTTP has the Use-Agent: header to help identify webbots, but > Gemini doesn't have that. How do I instruct a Gemini bot with robots.txt, > when there's no way for a Gemini bot to identify itself? > > -spc >
It was thus said that the Great Natalie Pendragon once stated: > For the GUS crawl at least, the crawler doesn't identify itself _to_ > crawled sites, but it does obey blocks of rules in robots.txt files > according to user-agent. So it works without needing a user-agent > header. > > It obeys user-agent of `*`, `indexer`, and `gus` in order of > increasing importance. > > There's been some talk of the generic sorts of user-agents in the > past, which I think is a really nice idea. If `indexer` is a > user-agent that both sites and crawlers had some sort of informal > consensus on, then sites wouldn't need to worry about keeping up with > any new indexers popping up. > > Some other generic user-agent ideas, iirc, were `archiver` and > `proxy`. That's a decent idea, but that still doesn't help when I want to block a particular bot for "misbehaving" (in some nebulous way). For example, there's this one bot, "The Knowledge AI" which sends requests like /%22http:/wesiseli.com/magician/%22 [1] (and yes, that's an actual example, pulled straight off the log file). It's not quite yet annoying enough to block [2] but at least I have some chance of blocking it via robots.txt (which it does request). -spc [1] I can't quite figure out why it includes the quotes as part of the link. *All* the links on my websites look like: <a href="http://example.com/"> and for the most part, it can parse those links correctly. And that's not limited to just *one* bot, but several of them have that behavior. [2] Although it's nearly impossible to find anything out about it as the user-agent string is literally "The Knowledge AI", so it might be worth blocking it just out of spite.
On Fri Jul 24, 2020 at 12:01 AM CEST, Natalie Pendragon wrote: > There's been some talk of the generic sorts of user-agents in the > past, which I think is a really nice idea. If `indexer` is a > user-agent that both sites and crawlers had some sort of informal > consensus on, then sites wouldn't need to worry about keeping up with > any new indexers popping up. > > Some other generic user-agent ideas, iirc, were `archiver` and > `proxy`. I still really like this idea. It will be a long and tedious undertaking to build up some kind of rough consensus on a set of user-agents with good coverage and granularity of different scenarios, but I think it might be worth the effort. It would be great to finally get a proper "robots.txt for Gemini" side spec written up. Cheers, Solderpunk
On Fri Jul 24, 2020 at 2:59 AM CEST, Sean Conner wrote: > That's a decent idea, but that still doesn't help when I want to block a > particular bot for "misbehaving" (in some nebulous way). This is true, but at the end of the day, even if we had a user-agent header, a badly written bot can always ignore robots.txt, or request robots.txt and parse/respect it incorrectly, or just regularly change its user-agent to evade restrictions. There will *always* be scenarios where admins simply have to resort to IP bans. Gemini just bumps into those scenarios slightly sooner than the web. Some kind of official documentation on how to write good bots would probably not go astray... Cheers, Solderpunk
---