💾 Archived View for gemi.dev › gemini-mailing-list › 000309.gmi captured on 2023-11-04 at 12:40:22. Gemini links have been rewritten to link to archived content

View Raw

More Information

➡️ Next capture (2023-12-28)

-=-=-=-=-=-=-

Open Source Proxy

alex wennerberg <alex (a) alexwennerberg.com>

Hey all!

I'm experimenting with building a web app that includes some proxying of
Gemini content and I was wondering if anyone has put together an open
source proxy for Gemini. I am aware of https://portal.mozz.us/ and
https://proxy.vulpes.one/, but I can't find the code for either of them.

On a related note, I am working on a Python library to convert gmi files to
html which you all may find helpful: https://git.sr.ht/~alexwennerberg/gmi2html

All the best,

Alex

Link to individual message.

acdw <acdw (a) acdw.net>

On 2020-07-22 (Wednesday) at 19:48, alex wennerberg <alex at alexwennerberg.com> wrote:

> Hey all!
> 
> I'm experimenting with building a web app that includes some proxying of
> Gemini content and I was wondering if anyone has put together an open
> source proxy for Gemini. I am aware of https://portal.mozz.us/ and
> https://proxy.vulpes.one/, but I can't find the code for either of them.

Though the vulpes code isn't easy to find, I did find it at 
https://git.feuerfuchs.dev/Feuerfuchs/gopherproxy. I look forward to your 
work! I've been wanting to include a proxy with breadpunk.club. 

-- 
~ acdw
acdw.net | breadpunk.club/~breadw

Link to individual message.

alex wennerberg <alex (a) alexwennerberg.com>

Nice find! Thanks for the help :)

On Wed Jul 22, 2020 at 9:35 PM CDT, acdw wrote:
> On 2020-07-22 (Wednesday) at 19:48, alex wennerberg
> <alex at alexwennerberg.com> wrote:
>
> > Hey all!
> > 
> > I'm experimenting with building a web app that includes some proxying of
> > Gemini content and I was wondering if anyone has put together an open
> > source proxy for Gemini. I am aware of https://portal.mozz.us/ and
> > https://proxy.vulpes.one/, but I can't find the code for either of them.
>
> Though the vulpes code isn't easy to find, I did find it at
> https://git.feuerfuchs.dev/Feuerfuchs/gopherproxy. I look forward to
> your work! I've been wanting to include a proxy with breadpunk.club.
>
> --
> ~ acdw
> acdw.net | breadpunk.club/~breadw

Link to individual message.

Jason McBrayer <jmcbray (a) carcosa.net>

"alex wennerberg" <alex at alexwennerberg.com> writes:

> I'm experimenting with building a web app that includes some proxying of
> Gemini content and I was wondering if anyone has put together an open
> source proxy for Gemini. I am aware of https://portal.mozz.us/ and
> https://proxy.vulpes.one/, but I can't find the code for either of them.

This is cool, but when you stand it up, don't forget an appropriate
robots.txt! 

-- 
+-----------------------------------------------------------+
| Jason F. McBrayer                    jmcbray at carcosa.net  |
| A flower falls, even though we love it; and a weed grows, |
| even though we do not love it.            -- Dogen        |

Link to individual message.

Sean Conner <sean (a) conman.org>

It was thus said that the Great Jason McBrayer once stated:
> 
> This is cool, but when you stand it up, don't forget an appropriate
> robots.txt! 

  Question---HTTP has the Use-Agent: header to help identify webbots, but
Gemini doesn't have that.  How do I instruct a Gemini bot with robots.txt,
when there's no way for a Gemini bot to identify itself?

  -spc

Link to individual message.

Petite Abeille <petite.abeille (a) gmail.com>



> On Jul 23, 2020, at 22:45, Sean Conner <sean at conman.org> wrote:
> 
> How do I instruct a Gemini bot with robots.txt,
> when there's no way for a Gemini bot to identify itself?

Through a side channel such as the TLS certificate? Robots could identify 
themselves there.

Otherwise, only User-agent: * wide rules apply.

Alternatively, some creative use of TLS fingerprinting.

Link to individual message.

Natalie Pendragon <natpen (a) natpen.net>

For the GUS crawl at least, the crawler doesn't identify itself _to_
crawled sites, but it does obey blocks of rules in robots.txt files
according to user-agent. So it works without needing a user-agent
header.

It obeys user-agent of `*`, `indexer`, and `gus` in order of
increasing importance.

There's been some talk of the generic sorts of user-agents in the
past, which I think is a really nice idea. If `indexer` is a
user-agent that both sites and crawlers had some sort of informal
consensus on, then sites wouldn't need to worry about keeping up with
any new indexers popping up.

Some other generic user-agent ideas, iirc, were `archiver` and
`proxy`.

On Thu, Jul 23, 2020 at 04:45:50PM -0400, Sean Conner wrote:
> It was thus said that the Great Jason McBrayer once stated:
> >
> > This is cool, but when you stand it up, don't forget an appropriate
> > robots.txt!
>
>   Question---HTTP has the Use-Agent: header to help identify webbots, but
> Gemini doesn't have that.  How do I instruct a Gemini bot with robots.txt,
> when there's no way for a Gemini bot to identify itself?
>
>   -spc
>

Link to individual message.

Sean Conner <sean (a) conman.org>

It was thus said that the Great Natalie Pendragon once stated:
> For the GUS crawl at least, the crawler doesn't identify itself _to_
> crawled sites, but it does obey blocks of rules in robots.txt files
> according to user-agent. So it works without needing a user-agent
> header.
> 
> It obeys user-agent of `*`, `indexer`, and `gus` in order of
> increasing importance.
> 
> There's been some talk of the generic sorts of user-agents in the
> past, which I think is a really nice idea. If `indexer` is a
> user-agent that both sites and crawlers had some sort of informal
> consensus on, then sites wouldn't need to worry about keeping up with
> any new indexers popping up.
> 
> Some other generic user-agent ideas, iirc, were `archiver` and
> `proxy`.

  That's a decent idea, but that still doesn't help when I want to block a
particular bot for "misbehaving" (in some nebulous way).  For example,
there's this one bot, "The Knowledge AI" which sends requests like

	/%22http:/wesiseli.com/magician/%22 [1]

(and yes, that's an actual example, pulled straight off the log file).  It's
not quite yet annoying enough to block [2] but at least I have some chance
of blocking it via robots.txt (which it does request).  

  -spc

[1]	I can't quite figure out why it includes the quotes as part of the
	link.  *All* the links on my websites look like:

		<a href="http://example.com/">

	and for the most part, it can parse those links correctly.  And
	that's not limited to just *one* bot, but several of them have that
	behavior.

[2]	Although it's nearly impossible to find anything out about it as the
	user-agent string is literally "The Knowledge AI", so it might be
	worth blocking it just out of spite.

Link to individual message.

Solderpunk <solderpunk (a) posteo.net>

On Fri Jul 24, 2020 at 12:01 AM CEST, Natalie Pendragon wrote:

> There's been some talk of the generic sorts of user-agents in the
> past, which I think is a really nice idea. If `indexer` is a
> user-agent that both sites and crawlers had some sort of informal
> consensus on, then sites wouldn't need to worry about keeping up with
> any new indexers popping up.
>
> Some other generic user-agent ideas, iirc, were `archiver` and
> `proxy`.

I still really like this idea.  It will be a long and tedious
undertaking to build up some kind of rough consensus on a set of
user-agents with good coverage and granularity of different scenarios,
but I think it might be worth the effort.

It would be great to finally get a proper "robots.txt for Gemini" side
spec written up.

Cheers,
Solderpunk

Link to individual message.

Solderpunk <solderpunk (a) posteo.net>

On Fri Jul 24, 2020 at 2:59 AM CEST, Sean Conner wrote:

> That's a decent idea, but that still doesn't help when I want to block a
> particular bot for "misbehaving" (in some nebulous way).

This is true, but at the end of the day, even if we had a user-agent
header, a badly written bot can always ignore robots.txt, or request
robots.txt and parse/respect it incorrectly, or just regularly change
its user-agent to evade restrictions.  There will *always* be scenarios
where admins simply have to resort to IP bans.  Gemini just bumps into
those scenarios slightly sooner than the web.

Some kind of official documentation on how to write good bots would
probably not go astray...

Cheers,
Solderpunk

Link to individual message.

---

Previous Thread: Username/password authentication strategy

Next Thread: Getting slammed by a client