💾 Archived View for gemi.dev › gemini-mailing-list › 000539.gmi captured on 2023-11-04 at 12:54:57. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

Unicode vs. the World

📧 Messages: 34
🗣️ Authors: 15
📅 First Message: 2020-12-15 13:27
📅 Last Message: 2020-12-24 20:30

Petite Abeille <petite.abeille (a) gmail.com>

📅 Sent: 2020-12-15 13:27
📧 Message 1 of 34

[In the spirit of Scott Pilgrim vs. the World]

There has been a handful of  intertwingled* conversation about the topic.


To recap:

2020-12-04 Stephane Bortzmeyer got the ball rolling with "IDN with 
Gemini?": https://lists.orbitalfox.eu/archives/gemini/2020/003788.html
2020-12-08 John Cowan followed with "Three possible uses for IRIs": 
https://lists.orbitalfox.eu/archives/gemini/2020/003873.html
2020-12-09 Jason McBrayer contributed "Some reading on IRIs and IDNs": 
https://lists.orbitalfox.eu/archives/gemini/2020/003923.html

?? To be charitable, we can also include Alex's self-described "shitpost"  
dated 2020-12-15 : https://lists.orbitalfox.eu/archives/gemini/2020/004055.html

[2020-12-15T01:47:20.412Z] <nytpu> sending an message to the ML making fun 
of the long-running spec-changing threads. i'll probably regret it, but here goes
[2020-12-15T07:05:14.499Z] <nytpu> i've bitched about it but this is the 
first time i've really addressed the points other than in passing
[2020-12-15T07:05:42.682Z] <nytpu> and even then it's more a shitpost than 
a real rebuttal, don't take it too seriously


So what's the issue making Alex lose his marbles, thin-skin aside?

It boils down to this:

 => gemini://?.mozz.us/?.gmi ?Hoppity hop?

What do do with such a construct? Possible? Not possible? Allowed? Not 
allowed? First class citizen? Afterthought? How do deal with it, if at all? 

Decisions, decisions, decisions.

Technically speaking, while text/gemini is Unicode friendly by default, 
the links are not. The location part must be encoded, following 
idiosyncratic, local customs, perhaps such as:

 => gemini://xn--4o8h.mozz.us/%F0%9F%90%87.gmi ?Hoppity hop?

In other words, a bit of punycode + percent encoding + glossing over 
normalization + other niceties. Everything must be US-ASCII clean at the end of the day.

Some will make the distinction between "content" vs. "addressing":

[2020-12-15T07:35:09.590Z] <bie> also... this was never about 
internationalized content, but a lot of people like to pretend that it is
[2020-12-15T07:36:40.861Z] <bie> addressing != content

While there are some merits about such hair splitting -as it has be 
handled at different level of the stack- it distracts from the crux of the problem:

=> gemini://?.mozz.us/?.gmi ?Hoppity hop? 
vs.
=> gemini://xn--4o8h.mozz.us/%F0%9F%90%87.gmi ?Hoppity hop?

As it stands, the first variant cannot be handled by gemini -neither in 
text/gemini, nor in the protocol itself- with further technical gotchas 
such as address resolution and what not along the way. 

It must be converted to the second variant, the US-ASCII one.

So, what to do? This is what these various conversations are about. 
Exploring what the scope of the problem is, and what to do about it, if 
anything. So one can eventually reach an informed decision.

For example:

[2020-12-14T22:12:14.914Z] <remyabel> I lurk this channel and the mailing 
lists and keep seeing people trying to extend gemini or make it web-like, 
there's just no point in arguing against it
[2020-12-14T22:12:28.578Z] <CoopDot> I used to be in the US-ASCII only 
camp but now it's more "do the bare mininum to not forbid UTF-8 'URLs' in 
the spec and make strong recommendations in best-practices.gmi"

^Those are the "cannot be arsed" camp: things are the way they are, and 
cannot be bothered to changed anything, technically speaking... we are 
done. The "not-my-problem" camp.


[2020-12-15T07:30:13.193Z] <khuxkm> honestly my issue with the iri thread 
was the whole "we NEED this" and "we MUST do this it's our MORAL DUTY"
[2020-12-15T07:30:52.931Z] <khuxkm> like forcing everybody to use IRIs or 
be non-compliant with the spec is somehow going to solve discrimination

^Those are the... hmmm... oh-so-fragile "entitled" camp.


To summarize: this is a genuine choice for gemini. And not so much a technical issue.


-- 
?????


Tangentially unrelated, as always:

The Internet is for End Users
https://tools.ietf.org/html/rfc8890

Terminology, Power, and Inclusive Language in Internet-Drafts and RFCs
https://tools.ietf.org/id/draft-knodel-terminology-04.html



	https://en.wikipedia.org/wiki/Intertwingularity

Link to individual message.

Björn Wärmedal <bjorn.warmedal (a) gmail.com>

📅 Sent: 2020-12-15 15:05
📧 Message 2 of 34

Thank you so much for the summary! I lost track of the ML for a few
days and... it was just too much >.<

My contention is this: I want us to support internationalization as
best we can. And as far as I understand it the web has done this with
punycoded domain names and percent encoded paths for years. And after
a few hours of fiddling with my own cli tool gemcall
(https://notabug.org/tinyrabbit/gemcall/src/master/gemcall) it doesn't
look like it's too hard. Check line 35:

parsed = up.urlparse(url).encode("idna")

As far as I can tell gemcall can now handle gemini://[rabbit emoji
that I don't know how to include...].mozz.us/

As for the path that follows, that must be percent encoded in the
gemtext document. There is no way for a client to know if a path is
already percent encoded or not, and percent encoding twice breaks the
link. Consider this:
=> gemini://example.com/why-space-is-%20-in-urls.gmi We see that this
needs to be percent encoded, but a tool can't reliably tell if it does
or if it is already.

Requiring clients to punycode domain names will break existing
clients. Sorry about that, but let's just fix them instead of
complaining about it.

Cheers,
ew0k

Oh! Mandatory rabbit!

 ()_()
 (^.^)
_(| |)_

Link to individual message.

Alex // nytpu <alex (a) nytpu.com>

📅 Sent: 2020-12-15 15:28
📧 Message 3 of 34

This is the only time I'm going to reply to one of these threads, but I
should actually say what I think: Supporting IDNs and IRIs is something
I can get behind, but it simply doesn't require a spec change. Maybe a
"Gemini Best Practices" change, or even (in the extreme) a companion
spec detailing the basics of punycoding and percent encoding, but that's
it.


Firstly, on IRIs: there is literally no reason not to support them. You
already percent-encode half of ascii anyways, why not just encode the
rest? 100% on board.

IDNs I also agree with, as long as it doesn't get too out of hand with
what you're requiring from people, and as long as at least some
consideration is given to people's more obscure languages that may or
may not have various necessary libraries. I still think it should be
supported, but I'd say to heed the robustness principle: "Be liberal in
what you accept, and conservative in what you send."


I support allowing people to write in unicode in gemtext, including in
link lines, but the client should convert it (transparently or not) for
the server, no spec change required. Look at what Lagrange did for
v0.13! It even displays the unicode URL in the address bar, and deals
with all the conversions so the content authors and users don't even
have to think about it. That's the optimal change for an "advanced"
client I'd say, and a "simple" client could just convert it to
punycode/percent encoded once and display and work with that afterwards
so they don't have to worry about the unicode version internally.
https://gmi.skyjake.fi/lagrange/

My main complaint I have is how long these threads run, and how they
completely overtake the mailing list, drowning out pretty much
everything else. Even if they were arguing about something I
passionately argue for, I'd still make fun of them because they're so
long that they're farcical. They're full of people misreading everything
that's being said (I'm within that group), people that argue about
something else that's vaguely related but not really. The first thread
was about IDNs and people immediately started talking about IRIs
instead! (or maybe vice-versa? I can't keep the two terms straight).


	**


The main reason I wrote this is to clarify that I am firmly against
changing the spec, no matter how noble the causes, for anything other
than clarification or typos. There are lots of possibilities that don't
require a spec change, both in internationalization support and the
other common mailing list complaints that were long like this thread in
the past (usually about gemtext's "weaknesses").

Just go and do your own thing, experiment, chat on the mailing list
about something if it requires client/server support, and maybe people
will support it! You can use their experience to guide you before
actually suggesting any real best practice changes, companion specs,
etc. For instance, Lagrange really changed my opinion. Its change to
support the full suite of Iwhatevers was much simpler than I expected,
which lightened up my opinion, because I was worried it would result in
web-browser levels of just annoying and impossible for an individual to
write. ("TLS is bad enough, but at least all but the most esoteric
languages have a library somewhere.") Lagrange showed it's actually not
that bad, adding it in was no worse than regular URI parsing.

In summary, Just experiment, chat on the ML (preferably on multiple
topics...) and just do cool stuff! It's what gemini is all about, isn't
it?

-- 
Alex // nytpu
alex at nytpu.com
GPG Key: https://www.nytpu.com/files/pubkey.asc
Key fingerprint: 43A5 890C EE85 EA1F 8C88 9492 ECCD C07B 337B 8F5B
https://useplaintext.email/

Link to individual message.

Côme Chilliet <come (a) chilliet.eu>

📅 Sent: 2020-12-15 15:41
📧 Message 4 of 34

Le mardi 15 d?cembre 2020, 16:05:14 CET Bj?rn W?rmedal a ?crit :
> As for the path that follows, that must be percent encoded in the
> gemtext document. There is no way for a client to know if a path is
> already percent encoded or not, and percent encoding twice breaks the
> link. Consider this:
> => gemini://example.com/why-space-is-%20-in-urls.gmi We see that this
> needs to be percent encoded, but a tool can't reliably tell if it does
> or if it is already.

This is not true, even when using IRI, reserved characters such as spaces 
HAVE to be percent-encoded.

So, if you see a "%", it is percent encoding. If you want to link to a 
path containing a percent, you have to percent encode the percent, resulting in %25.

As a result, percent encoding twice does not break the link, as you only 
percent encode what is not percent encoded already.

C?me

Link to individual message.

Philip Linde <linde.philip (a) gmail.com>

📅 Sent: 2020-12-15 16:30
📧 Message 5 of 34

On Tue, 15 Dec 2020 08:28:00 -0700
Alex // nytpu <alex at nytpu.com> wrote:

> My main complaint I have is how long these threads run, and how they
> completely overtake the mailing list, drowning out pretty much
> everything else. Even if they were arguing about something I
> passionately argue for, I'd still make fun of them because they're so
> long that they're farcical. They're full of people misreading everything
> that's being said (I'm within that group), people that argue about
> something else that's vaguely related but not really. The first thread
> was about IDNs and people immediately started talking about IRIs
> instead! (or maybe vice-versa? I can't keep the two terms straight).

I highly recommend using a client with a clearly threaded overview of
incoming messages. For example, if I am not interested in a particular
thread of discussion, or it has become hard to follow, I can just fold
it away. We should perhaps be better at changing the subject line in
cases where discussion delves into details or strays from the original
topic.

I believe that misreading fills an important function when discussing a
specification; even a fundamentally bad-faith reading is useful. In my
opinion a spec should leave as little room for interpretation as
possible, and misreading exposes these little ambiguities at an earlier
stage where they would otherwise later cause divergent implementations.
It also makes it clear when ideas are more complex than anticipated. I
think we hit an iceberg with IDN/IRI, and we've failed to properly
separate discussion about its rationale from discussion about its
implementation, possibly because Gemini is explicit in that its
rationale also concerns its implementation details (paraphrased: it
should be simple and easy to implement).

> The main reason I wrote this is to clarify that I am firmly against
> changing the spec, no matter how noble the causes, for anything other
> than clarification or typos. There are lots of possibilities that don't
> require a spec change, both in internationalization support and the
> other common mailing list complaints that were long like this thread in
> the past (usually about gemtext's "weaknesses").

I agree with this, at least to the point that changes, if any, should
guarantee forward compatibility with older implementations. Breaking
changes have a high cost and so should be reserved for breaking bugs,
not nice-to-haves or workable practical shortcomings. There are many
features that I think would improve the protocol when thought of as
just features, but would be detrimental when considering the social
burden of formalizing and implementing them. I'll happily trade those
features for a stable and clear spec.

-- 
Philip

Link to individual message.

Stephane Bortzmeyer <stephane (a) sources.org>

📅 Sent: 2020-12-15 17:23
📧 Message 6 of 34

On Tue, Dec 15, 2020 at 02:27:36PM +0100,
 Petite Abeille <petite.abeille at gmail.com> wrote 
 a message of 122 lines which said:

> To summarize: this is a genuine choice for gemini. And not so much a
> technical issue.

It is quite possible that there is unanimity on these two points. So,
at least, we all agree on something.

Link to individual message.

Stephane Bortzmeyer <stephane (a) sources.org>

📅 Sent: 2020-12-15 17:26
📧 Message 7 of 34

On Tue, Dec 15, 2020 at 08:28:00AM -0700,
 Alex // nytpu <alex at nytpu.com> wrote 
 a message of 100 lines which said:

> My main complaint I have is how long these threads run, and how they
> completely overtake the mailing list, drowning out pretty much
> everything else.

[I agree with Philip: reading email without a threaded email reader is
a bad idea.]

I'm relatively new here, so I used the mailing list but if there are
other ways to discuss and follow the work on the specification (or in
companion documents, such as robots.txt standard), I'd be happy to use
them. Issues on a ticket tracker at a Gitlab?

Link to individual message.

text@sdfeu.org <text (a) sdfeu.org>

📅 Sent: 2020-12-15 18:38
📧 Message 8 of 34

On Tue, 15 Dec 2020 18:26:14 +0100, Stephane Bortzmeyer wrote:

> [I agree with Philip: reading email without a threaded email reader is a
> bad idea.]
> 
> [?] other ways to discuss and follow the work on the specification [?]
> Issues on a ticket tracker at a Gitlab?

Not sure if Usenet newsgroups are still a thing to start these days, but 
I like that news does not clutter personal e-mail accounts.

No idea how and where  comp.protocols.gemini  could be established.

Link to individual message.

PJ vM <pjvm742 (a) disroot.org>

📅 Sent: 2020-12-15 19:11
📧 Message 9 of 34

On 12/15/20 4:41 PM, C?me Chilliet wrote:
> So, if you see a "%", it is percent encoding. If you want to link to 
> a path containing a percent, you have to percent encode the percent, 
> resulting in %25.
> 
> As a result, percent encoding twice does not break the link, as you 
> only percent encode what is not percent encoded already.

So that would define a special percent-encoding for clients, where
they'd encode everything except percent signs, right? So in this link:
=>gemini://example.com/????-why-space-is-%20-in-urls.gmi
, a client would have to percent-encode the emojis, but leave the "%20"
bit alone? This seems very confusing; it's also not one-to-one (encoding
then decoding "%20" gives " " back)... And if you just skip
percent-encoding when the only "encodable" characters in the path are
percent signs, that's confusing too. That rule also doesn't work on
=>gemini://example.com/%XY.gmi

Also, if an author wants to link to "why-space-is-%20-in-urls.gmi" at
example.com, the only option would be to write
=>gemini://example.com/why-space-is-%2520-in-urls.gmi
This introduces a pitfall for authors: they never have to think about
percent-encoding, *except* when there are percent signs in the path.

How is this better than agreeing that link paths in gemtext are always
completely percent-encoded? In that case, clients can percent-decode the
path and display that. Authors could use a tool that 'fully' (as in, it
also turns every "%" into "%25") percent-encodes a link for them.

Counterintuitively, in this way I think mandating completely
percent-encoded paths in gemtext link lines might actually result in
easier linking for authors.

The same (clients may/should display, authors use tool) could be done
with internationalised domain names (could be the same tool that does
the percent-encoding), but crucially there is no ambiguity there,
because an ascii domain name with "xn--" is unrepresentable in punycode
and disallowed (I think). On the other hand, allowing anything
whatsoever in the domain name and nothing in the path would be strange
and a bit inconsistent.

Assuming we don't do IRI paths in gemtext link lines, I don't really
have an clear opinion regarding IDNs, the choice is between:

	all clients need to convert to punycode when following a link, authors

can easily link to IDNs without a tool (though they're already using a
tool for unicode paths), somewhat inconsistent/strange

	fancy clients will convert from punycode when displaying a link,

authors need a tool to be able to easily make links to IDNs (though
they're already using a tool for unicode paths)

--
pjvm

Link to individual message.

Côme Chilliet <come (a) chilliet.eu>

📅 Sent: 2020-12-15 20:00
📧 Message 10 of 34

Le mardi 15 d?cembre 2020, 20:11:12 CET PJ vM a ?crit :
> So that would define a special percent-encoding for clients, where
> they'd encode everything except percent signs, right? So in this link:
> =>gemini://example.com/????-why-space-is-%20-in-urls.gmi
> , a client would have to percent-encode the emojis, but leave the "%20"
> bit alone? This seems very confusing; it's also not one-to-one (encoding
> then decoding "%20" gives " " back)... And if you just skip
> percent-encoding when the only "encodable" characters in the path are
> percent signs, that's confusing too. That rule also doesn't work on
> =>gemini://example.com/%XY.gmi

Because this is not a valid link, neither URI nor IRI.

> Also, if an author wants to link to "why-space-is-%20-in-urls.gmi" at
> example.com, the only option would be to write
> =>gemini://example.com/why-space-is-%2520-in-urls.gmi
> This introduces a pitfall for authors: they never have to think about
> percent-encoding, *except* when there are percent signs in the path.

Yes, and spaces, and delimiter characters, such as "/".

> How is this better than agreeing that link paths in gemtext are always
> completely percent-encoded? In that case, clients can percent-decode the
> path and display that. Authors could use a tool that 'fully' (as in, it
> also turns every "%" into "%25") percent-encodes a link for them.

Because a completely percent encoded link is hell to read and to write, for instance:
gemini://gemini.circumlunar.space/%64%6f%63%73/%66%61%71%2e%67%6d%69

So I think you do not mean ?completely percent-encoded?, you mean percent 
encode non-ascii non-reserved text, and you feel like this is better 
because you are use to english and ascii.
But you will always need to remember which chars you need to percent 
encode. You will never be able to use "/" in a file name without percent 
encoding. Or "?".

> Counterintuitively, in this way I think mandating completely
> percent-encoded paths in gemtext link lines might actually result in
> easier linking for authors.

No, it is just a different set of characters to percent encode.

> The same (clients may/should display, authors use tool) could be done
> with internationalised domain names (could be the same tool that does
> the percent-encoding), but crucially there is no ambiguity there,
> because an ascii domain name with "xn--" is unrepresentable in punycode
> and disallowed (I think). On the other hand, allowing anything
> whatsoever in the domain name and nothing in the path would be strange
> and a bit inconsistent.

Yes, IDN are covered by punycode, but the question remains whether I am 
allowed to use the unicode form in a link line.
=> gemini://g?meaux.example.com Is that legal?
 
> Assuming we don't do IRI paths in gemtext link lines, I don't really
> have an clear opinion regarding IDNs, the choice is between:
> * all clients need to convert to punycode when following a link, authors
> can easily link to IDNs without a tool (though they're already using a
> tool for unicode paths), somewhat inconsistent/strange
> * fancy clients will convert from punycode when displaying a link,
> authors need a tool to be able to easily make links to IDNs (though
> they're already using a tool for unicode paths)

Yes.
I am for IDN in link lines, but I am also in favor of IRI in link lines.
And I would be supportive of using IRI in request line also for that 
matter. And redirect responses.

C?me

Link to individual message.

Sean Conner <sean (a) conman.org>

Subject Changed! New Subject: On mailing lists (was Re: Unicode vs. the World)
📅 Sent: 2020-12-15 22:42
📧 Message 11 of 34

It was thus said that the Great Alex // nytpu once stated:
> 
> My main complaint I have is how long these threads run, and how they
> completely overtake the mailing list, drowning out pretty much
> everything else. 

  You haven't been here long, have you?  Becaus for *months* this list
talked almost exclusively about text/gemini.  Just check the threaded
archives [1] and look upon the threads, ye mighty, and despair!  Take
special note how long the thread "Text reflow woes" goes (from 2019 well
into 2020).

  It's also worth to note that different people have different expectations
as to volume of email.  I've been on lists where people freak out if they
get more than 1 email per day.  Personally, I consider the volume of this
list to be low-to-mid levels of volume.  One list I was one (it is no longer
around) would typically get around double digits of email per day, and on
one memerable day, hit 500 messages (yes, 500 email in a single day---that
set my expectations on what a "high-volume mailing list" is).

  -spc

[1]	https://lists.orbitalfox.eu/archives/gemini/2019/thread.html
	https://lists.orbitalfox.eu/archives/gemini/2020/thread.html

Link to individual message.

Björn Wärmedal <bjorn.warmedal (a) gmail.com>

Subject Changed! New Subject: Unicode vs. the World
📅 Sent: 2020-12-16 07:59
📧 Message 12 of 34

> How is this better than agreeing that link paths in gemtext are always
> completely percent-encoded? In that case, clients can percent-decode the
> path and display that. Authors could use a tool that 'fully' (as in, it
> also turns every "%" into "%25") percent-encodes a link for them.
>
> Counterintuitively, in this way I think mandating completely
> percent-encoded paths in gemtext link lines might actually result in
> easier linking for authors.

This is -- as I read it -- what the spec requires now. I think that's
the best solution. The wording in the spec can (and maybe should) be
clarified, though.

> * all clients need to convert to punycode when following a link, authors
> can easily link to IDNs without a tool (though they're already using a
> tool for unicode paths), somewhat inconsistent/strange
> * fancy clients will convert from punycode when displaying a link,
> authors need a tool to be able to easily make links to IDNs (though
> they're already using a tool for unicode paths)

I think that all clients *should* convert links to punycode. If they
did authors could write punycoded or unicode domains in their links
and both would work.

Right now authors can't expect clients to punycode for them, so the
safest recourse is to punycode links yourself before publishing.

Note that none of this requires a spec change (except for maybe
clarifying the percent encoding of links in gemtext). I think it's
fair to assume that IDNs will just work, and if they don't work in a
browser/client we can report that as a bug (or send a PR that fixes
it). After all IDNs have existed for some years, and URL libs across
languages are very likely to support it.

Cheers,
ew0k

??

Link to individual message.

Jason McBrayer <jmcbray (a) carcosa.net>

📅 Sent: 2020-12-16 14:13
📧 Message 13 of 34

Bj?rn W?rmedal <bjorn.warmedal at gmail.com> writes:

>> How is this better than agreeing that link paths in gemtext are always
>> completely percent-encoded? In that case, clients can percent-decode the
>> path and display that. Authors could use a tool that 'fully' (as in, it
>> also turns every "%" into "%25") percent-encodes a link for them.
>>
>> Counterintuitively, in this way I think mandating completely
>> percent-encoded paths in gemtext link lines might actually result in
>> easier linking for authors.
>
> This is -- as I read it -- what the spec requires now. I think that's
> the best solution. The wording in the spec can (and maybe should) be
> clarified, though.

I don't think this is going to be acceptable for authors. It's
unreasonable to ask authors to use a tool other than their favorite text
editor to write gemtext. Why is it reasonable for the client to have to
punycode the domain (an uncommon encoding for which not every common
language has a library), but unreasonable for it to have to urlencode
the path (a common encoding for which libraries are ubiquitous)? Why is
it so hard to convince people to just do the right thing?

???

-- 
+-----------------------------------------------------------+
| Jason F. McBrayer                    jmcbray at carcosa.net  |
| A flower falls, even though we love it; and a weed grows, |
| even though we do not love it.            -- Dogen        |

Link to individual message.

PJ vM <pjvm742 (a) disroot.org>

📅 Sent: 2020-12-16 14:22
📧 Message 14 of 34

On 12/15/20 9:00 PM, C?me Chilliet wrote:
>> =>gemini://example.com/%XY.gmi
> Because this is not a valid link, neither URI nor IRI.

My thinking was that it is a valid link after percent-encoding.

But OK, so the client would percent-encode exactly those characters that
are not reserved but not in ascii. That would indeed be unambiguous. It
would not be one-to-one with percent-decoding, though, which is
unavoidable with this approach to IRIs.

> a completely percent encoded link ...

Yes, that was a misuse of the word "completely" on my part

> you feel like this is better because you are use to english and
> ascii.

That is a failed attempt at mind-reading.

> But you will always need to remember which chars you need to percent
> encode. You will never be able to use "/" in a file name without
> percent encoding. Or "?".

Yes, when someone wants to link to a resource with "?", "/" or "#" in
the filename, that will basically always require manual intervention.

One error in my previous email was that of course, you can also use a
tool to percent-encode just spaces and percent signs for you. There's
not much difference in what the author has to think about, then.

Still, with both IRI paths and IDNs, I'm not really seeing the "added
value" of having them in the spec. I'm quite sure they will be there
either way: if it doesn't get into the spec, it is still possible for
clients to provide the same experience with (seemingly) about the same
amount of programming effort - and it seems plenty of client authors
would -, and authors would not be much worse off if they use a tool.

Meanwhile, the negatives are rather visible to me: they're breaking
changes, they increase the complexity that a client *must* have.

-- 
pjvm

Link to individual message.

Björn Wärmedal <bjorn.warmedal (a) gmail.com>

📅 Sent: 2020-12-16 19:39
📧 Message 15 of 34


> I don't think this is going to be acceptable for authors.

Maybe not. I don?t really know.

> It's
> unreasonable to ask authors to use a tool other than their favorite text
> editor to write gemtext.

Is it? Unreasonable is a strong word here.

I assume there would be some servers out there that would do this on the 
fly when serving gemtext, but I can?t know that for sure. There could also 
be a CLI tool you can run on your file that fixes links. Or some other solution.

> Why is it reasonable for the client to have to
> punycode the domain (an uncommon encoding for which not every common
> language has a library),

I made the assumption that most languages dealing in stuff like URLs would 
have support for it. I may be in the wrong there. I also made the 
assumption that punycoding was common, but I may be in the wrong there 
too. Which method *is* common?

> but unreasonable for it to have to urlencode
> the path (a common encoding for which libraries are ubiquitous)?

Because ? as I tried to point out ? there is no reasonably simple 
heuristic for determining whether a URL is already percent encoded or not. 
And percent encoding a URL that is already percent encoded exchanges all % 
characters with %25. Attempting to punycode a domain name that is already 
punycoded, however, changes nothing at all. No heuristics are needed, the 
client can just punycode everything.

> Why is
> it so hard to convince people to just do the right thing?

Why are you so adamantly convinced that *you* are arguing for ?the right 
thing?? Is there an objective measurement here that you may share with me?

????

Link to individual message.

John Cowan <cowan (a) ccil.org>

📅 Sent: 2020-12-16 20:49
📧 Message 16 of 34

On Tue, Dec 15, 2020 at 2:11 PM PJ vM <pjvm742 at disroot.org> wrote:

> This introduces a pitfall for authors: they never have to think about
> percent-encoding, *except* when there are percent signs in the path.
>

Or spaces, because in a link line a space terminates the URI.  So if the
author wants to link to "gemini://example.com/foo bar", the author *must*
write gemini://example.com/foo%20.bar.  In principle you have the same
problem with wanting line endings in a URI, though they are much less
likely to be an issue.

All this boils down to this question:  Who should pay the price for i18n in
links, clients or authors?"  Any third alternative is hacky at best (there
is typically no library routine for "encode everything that needs to be
encoded except in the sequences %20 and %25") and broken at worst.

How is this better than agreeing that link paths in gemtext are always
> completely percent-encoded?

I don't understand.  Do you mean "link paths are already %-encoded when you
get them" (status quo) or "link paths must be %-encoded when you get them"
(IRIs in link lines)?

On Tue, Dec 15, 2020 at 3:00 PM C?me Chilliet <come at chilliet.eu> wrote:

Because a completely percent encoded link is hell to read and to write, for
> instance:
> gemini://gemini.circumlunar.space/%64%6f%63%73/%66%61%71%2e%67%6d%69

+1

> Yes.
> I am for IDN in link lines, but I am also in favor of IRI in link lines.
>

+1

> And I would be supportive of using IRI in request line also for that
> matter. And redirect responses.
>

-1.  That's a change to the protocol, and only protocol agents (clients,
servers) should see such lines; it doesn't matter how ugly they are.  So
"machines speak URIs, humans speak IRIs".

John Cowan          http://vrici.lojban.org/~cowan        cowan at ccil.org
Uneasy lies the head that wears the Editor's hat! --Eddie Foirbeis Climo

Link to individual message.

ew.gemini <ew.gemini (a) nassur.net>

📅 Sent: 2020-12-16 21:18
📧 Message 17 of 34


Hello,

Jason McBrayer writes:

> It's unreasonable to ask authors to use a tool other than
> their favorite text editor to write gemtext.

Yepp. 1+

Cheers,
Erich

-- 
Keep it simple!

Link to individual message.

Sean Conner <sean (a) conman.org>

📅 Sent: 2020-12-16 23:41
📧 Message 18 of 34

It was thus said that the Great Bj?rn W?rmedal once stated:
> 
> > but unreasonable for it to have to urlencode the path (a common encoding
> > for which libraries are ubiquitous)?
> 
> Because ? as I tried to point out ? there is no reasonably simple
> heuristic for determining whether a URL is already percent encoded or not.
> And percent encoding a URL that is already percent encoded exchanges all %
> characters with %25. Attempting to punycode a domain name that is already
> punycoded, however, changes nothing at all. No heuristics are needed, the
> client can just punycode everything.

  I can't say for certain what most clients do, but I'm under the impression
that some (the majority?) use some existing library to parse links.  The
specification states that relative links are allowed in text/gemini:

=> ../%F0%9D%92%BB%F0%9D%92%B6%F0%9D%93%83%F0%9D%92%B8%F0%9D%93%8E.txt 
Some ????? stuff here

but a full URI needs to be sent to the server, so some processing of the
link is required (specifically, section 5.2 of RFC-3986).  And existing
libraries help here.  The library I'm currently using will parse the above
link into the following structure:

	{
	  path = "../?????.txt"
	}

  Note how the text has been translated and any percent encoding has been
decoded.  Next, the base URL of the page:

	gemini://example.com/files/others/

has previously parsed (because it was needed to retrieve the page currently
being viewed):

	{
	  path = "/files/others/",
	  port = 1965.000000,
	  host = "example.com",
	  scheme = "gemini",
	}

  The two are then merged into a single reference:

	{
	  path = "/files/?????.txt"
	  port = 1965.000000,
	  host = "example.com",
	  scheme = "gemini",
	}

  Then to make a request, this new link is converted into a URI to make the
request:

	gemini://example.com/files/%F0%9D%92%BB%F0%9D%92%B6%F0%9D%93%83%F0%9D%92%B
8%F0%9D%93%8E.txt

  As you can see, that process has re-encoded the path, percent-encoding it.
I would expect that some (the majority?) of clients are doing something
similar to this---doing a conversion from percent-encoding, marging
references, then converting to percent-encoding (except for the host, which
needs to be converted to punycode).

  It would be instructive to know how clients are handling this---do they
decode percent-encoded data, merge the base link to the relative link and
re-encode?  Or something different?

  -spc

Link to individual message.

colecmac@protonmail.com <colecmac (a) protonmail.com>

📅 Sent: 2020-12-16 23:57
📧 Message 19 of 34

> It would be instructive to know how clients are handling this---do they
> decode percent-encoded data, merge the base link to the relative link and
> re-encode? Or something different?
>
> -spc


My clients (gemget, Amfora) are in Go, so I just `Parse` both the base link
and the relative link, and then use `base.ResolveReference(rel)`. This means
I don't have to do any decoding or anything at all.

URL.Path and URL.RawPath can be used to get the decoded and encoded path
respectively, although I have no need in this context.

https://golang.org/pkg/net/url/#URL
https://golang.org/pkg/net/url/#Parse
https://golang.org/pkg/net/url/#URL.ResolveReference


makeworld

Link to individual message.

Björn Wärmedal <bjorn.warmedal (a) gmail.com>

📅 Sent: 2020-12-17 07:39
📧 Message 20 of 34

How does a client handle a link like the following:
=> essays/why-spaces-are-%20-in-URLs.gmi

The assumption here is that the author has not percent encoded
themselves -- this is the actual filename, %20 and all.

How can the client tell if it's percent encoded or not? If you start
by decoding it you distort the filename. If you just assume it isn't
percent encoded and go ahead and do that you will handle this link
correctly but break any links that are already percent encoded. I've
only done this in python, using the urllib.parse library. I can tell
that to encode or decode, but it will do what I tell it to without
exception. It's up to me to build logic that avoids breaking the edge
cases.

We can decide to *always* percent encode links in gemtext (as the spec
states now) or to *never* do it, but I don't see how we can reasonably
have both. And never doing it means we can never link to a file with
spaces in the URL, and will have to percent decode anything we copy
paste from web browser's address bar. There will be extra work for
authors either way.

Consider another hypothetical case:
=> teddybearoftheyear.com/vote?ew0k%20The%20Great Vote for me!

How would you solve that?

However much I *want* to have IRIs and IDNs in gemtext and leave the
work to clients and servers, I don't have a solution for that as an
implementer.

Cheers,
ew0k

Link to individual message.

Sean Conner <sean (a) conman.org>

📅 Sent: 2020-12-17 09:48
📧 Message 21 of 34

It was thus said that the Great Bj?rn W?rmedal once stated:
> How does a client handle a link like the following:
> => essays/why-spaces-are-%20-in-URLs.gmi
> 
> The assumption here is that the author has not percent encoded
> themselves -- this is the actual filename, %20 and all.
> 
> How can the client tell if it's percent encoded or not? If you start
> by decoding it you distort the filename. If you just assume it isn't
> percent encoded and go ahead and do that you will handle this link
> correctly but break any links that are already percent encoded. I've
> only done this in python, using the urllib.parse library. I can tell
> that to encode or decode, but it will do what I tell it to without
> exception. It's up to me to build logic that avoids breaking the edge
> cases.
> 
> We can decide to *always* percent encode links in gemtext (as the spec
> states now) or to *never* do it, but I don't see how we can reasonably
> have both. And never doing it means we can never link to a file with
> spaces in the URL, and will have to percent decode anything we copy
> paste from web browser's address bar. There will be extra work for
> authors either way.
> 
> Consider another hypothetical case:
> => teddybearoftheyear.com/vote?ew0k%20The%20Great Vote for me!
> 
> How would you solve that?
> 
> However much I *want* to have IRIs and IDNs in gemtext and leave the
> work to clients and servers, I don't have a solution for that as an
> implementer.

  I don't have a solution either, and while trying to nail down every
possible corner case is admirable, sometimes, you just have to say, "don't
do that!" (or in other words, document or warn about the corner case).

  It's already the case on Unix systems where a file name can technically
have any character other than '/' (because it's the path separator) and NUL
(marks the end of the string), but I doubt you'll find any filenames with
control characters [1] or even "problematic characters because of the shell"
like "&", "?", or "*" in them.  People just kind of learn what they can and
can't use for filenames over time.  In fact, that might be an interesting
thing for Lupa [2] or GUS to report on---characters found in filenames [3].

  I'm not sure how apropos this is, but years ago, when I was at university
studying Computer Science, I was writing a program (for a friend, not course
related) where I wanted to log errors so they would later be seen (as the
program would run unattended, and any messages to the display would not be
seen).  I could log to a file, but the disk could fill up.  Okay, if that
happened, I could log to the printer, but there might not be a printer (or
it could be turned off---this back when printers were hooked directly to a
computer).  I asked one of my instructors (who worked at IBM, and was on the
team for one of the first Fortran compilers for IBM) what I should do.  His
advice was (and as sad as this is, it's pretty true), if you don't know how
to handle an error, don't bother looking for it.

  -spc

> Cheers,
> ew0k

[1]	Unless it's for pranking someone, not that I would know that.

[2]	St?phane's new research crawler for Gemini.

[3]	This reminds me, I have a new feature on my own server that allows
	one to dive into a ZIP file:

		gemini://gemini.conman.org/test/UCSD-Pascal-source.zip/

		vs.

		gemini://gemini.conman.org/test/UCSD-Pascal-source.zip

	Right now it's not much of an issue since the filenames for the
	"proof-of-concept" file are just plain ASCII, but in the general
	case, I suppose I should support conversion of filenames to UTF-8,
	but that might be a hard case as well, as character encodings aren't
	readily recorded in ZIP files.

Link to individual message.

Sean Conner <sean (a) conman.org>

📅 Sent: 2020-12-17 09:59
📧 Message 22 of 34

It was thus said that the Great Bj?rn W?rmedal once stated:
> How does a client handle a link like the following:
> => essays/why-spaces-are-%20-in-URLs.gmi
> 
> The assumption here is that the author has not percent encoded
> themselves -- this is the actual filename, %20 and all.

  And speaking of this, test #31 of the Gemini Client Torture Test [1] has
this exact case---the link contains characters that should be encoded but
aren't.  It's been interesting to see which clients get an error, and which
ones encode the bad characters.  And for this test, there is no right
answer---it's there to inform implementors that you'll encounter wrong stuff
all the time, and you better be prepared to do *something* [2].

  -spc

[1]	gemini://gemini.conman.org/test/torture/0031

[2]	Not withstanding the advice I presented in my previous reply to
	this.  Sometimes, crashing *is* a valid response to some unknown
	state, but it really depends upon the context of the program [3].

[3]	I can expand on this if anyone cares.

Link to individual message.

Jason McBrayer <jmcbray (a) carcosa.net>

📅 Sent: 2020-12-17 13:31
📧 Message 23 of 34

Bj?rn W?rmedal <bjorn.warmedal at gmail.com> writes:

> Because ? as I tried to point out ? there is no reasonably simple
> heuristic for determining whether a URL is already percent encoded or
> not. And percent encoding a URL that is already percent encoded
> exchanges all % characters with %25.

It's not that hard. All you have to do is percent decode the path *first*,
then percent encode it. Consider this URL, which is a worst-case for
what you're talking about:

gemini://example.com/?%20?.gmi

Unquoting the path gives you 'gemini://example.com/? ?.gmi', of
course. And then quoting it gives you 

'gemini://example.com/%F0%9F%90%87%20%F0%9F%A5%95.gmi'

which decodes correctly.

Unquoting a path that is already plain ASCII does nothing to it.

-- 
Jason McBrayer      | ?Strange is the night where black stars rise,
jmcbray at carcosa.net | and strange moons circle through the skies,
                    | but stranger still is lost Carcosa.?
                    | ? Robert W. Chambers,The King in Yellow

Link to individual message.

Jason McBrayer <jmcbray (a) carcosa.net>

📅 Sent: 2020-12-17 13:55
📧 Message 24 of 34

Bj?rn W?rmedal <bjorn.warmedal at gmail.com> writes:

> How does a client handle a link like the following:
> => essays/why-spaces-are-%20-in-URLs.gmi
>
> The assumption here is that the author has not percent encoded
> themselves -- this is the actual filename, %20 and all.

This doesn't work in HTML/HTTP, either.

Go to https://jfm.carcosa.net/testme.html, look at the source, see what
happens with each link. The web server is Apache.

The upshot is that to include %, or any other reserved character in the
link, you do need to pre-encode it in your source. That's obvious for
' ', because of the syntax of links in gemtext. But it's also true of %,
etc. 

-- 
Jason McBrayer      | ?Strange is the night where black stars rise,
jmcbray at carcosa.net | and strange moons circle through the skies,
                    | but stranger still is lost Carcosa.?
                    | ? Robert W. Chambers,The King in Yellow

Link to individual message.

John Cowan <cowan (a) ccil.org>

📅 Sent: 2020-12-17 22:47
📧 Message 25 of 34

On Thu, Dec 17, 2020 at 2:39 AM Bj?rn W?rmedal <bjorn.warmedal at gmail.com>
wrote:\

How can the client tell if it's percent encoded or not? If you start
> by decoding it you distort the filename. If you just assume it isn't
> percent encoded and go ahead and do that you will handle this link
> correctly but break any links that are already percent encoded.

Exactly.  To make things worse, space is a protocol element in link lines
and *can't* be left unencoded by the author, whichever way we choose.

> We can decide to *always* percent encode links in gemtext (as the spec
> states now) or to *never* do it, but I don't see how we can reasonably
> have both.

I agree.  But what we can have (and it's messy, but not as messy as the
alternatives) is "authors encode percent and space" and "clients encode all
other reserved and non-ASCII characters."

> Consider another hypothetical case:

=> teddybearoftheyear.com/vote?ew0k%20The%20Great Vote for me!
>

That's the best you can do.  But in the case where the link line is

> => teddybearoftheyear.com/vote?????%20???????
> <http://teddybearoftheyear.com/vote?ew0k%20The%20Great> ??????? ?? ????! [1]
> [2] [3]
>
then the client must translate it for sending over the wire into

gemini://
teddybearoftheyear.com/vote?%D0%98%D0%B2%D0%B0%D0%BD%20%D0%93%D1%80%D0%BE%D
0%B7%D0%BD%D1%8B%D0%B9
<http://teddybearoftheyear.com/vote?ew0k%20The%20Great>

because making the author type all that is wholly abominable.  Online
URL-encoders are not that helpful, because they give you + instead of %20.

[1] This is Ivan the Terrible, who for most of his life was actually a
quite effective tsar despite his (occupational) paranoia and a serious
outbreak of madness just before he died; a better translation would be
"Ivan the Formidable".  Still, nobody would call him a teddy bear (and so
his ukase "Vote for me!" would probably be in vain).

[2] The latest spec change makes this line incorrect unless "
teddybearoftheyear.com/vote" is to be interpreted as a relative path.  It
needs to be prefixed by "gemini://" or at the very least "//".

[3] If the space had not been %-encoded by the author, the Tsar's second
name would be part of the link name and not part of the IRI.

John Cowan          http://vrici.lojban.org/~cowan        cowan at ccil.org
'My young friend, if you do not now, immediately and instantly, pull
as hard as ever you can, it is my opinion that your acquaintance in the
large-pattern leather ulster' (and by this he meant the Crocodile) 'will
jerk you into yonder limpid stream before you can say Jack Robinson.'
        --the Bi-Coloured-Python-Rock-Snake

Link to individual message.

Katarina Eriksson <gmym (a) coopdot.com>

📅 Sent: 2020-12-18 06:13
📧 Message 26 of 34

Help! I'm getting pulled in!

On IRC, I wrote:

[2020-12-14T21:26:44Z] <CoopDot> I'm staying out of debating IDN/IRI on the
ML. What I've had to say has already been said more than once. My position
have even shifted a bit since the threads started

Some discussion happened and then I wrote something that got quoted here:


Petite Abeille <petite.abeille at gmail.com> wrote:

For example:

[2020-12-14T22:12:14.914Z] <remyabel> I lurk this channel and the mailing
lists and keep seeing people trying to extend gemini or make it web-like,
there's just no point in arguing against it
[2020-12-14T22:12:28.578Z] <CoopDot> I used to be in the US-ASCII only camp
but now it's more "do the bare mininum to not forbid UTF-8 'URLs' in the
spec and make strong recommendations in best-practices.gmi"

^Those are the "cannot be arsed" camp: things are the way they are, and
cannot be bothered to changed anything, technically speaking... we are
done. The "not-my-problem" camp.


I'm assuming including me here was intentional. I truly can't tell if that
is an accurate description of my possession.

"I used to be in the US-ASCII only camp" refers to me no longer thinking
requiring everything to be encoded to pass as US-ASCII is the best idea.
This is me moving away from the status quo towards a possible compromise.
Or am I missing where we're going?

"Do [...] not forbid UTF-8 'URLs' in the spec". Not forbidding is almost
like allowing. We should attempt to not paint our selfs into a corner or
bet on the wrong horse. ?

"Make strong recommendations in best-practices.gmi" because we have to
address it somewhere.

Earlier in the same email, Petite Abeille <petite.abeille at gmail.com> wrote:

It boils down to this:

 => gemini://?.mozz.us/?.gmi ?Hoppity hop?

What do do with such a construct? Possible? Not possible? Allowed? Not
allowed? First class citizen? Afterthought? How do deal with it, if at all?


[...]

=> gemini://?.mozz.us/?.gmi ?Hoppity hop?
vs.
=> gemini://xn--4o8h.mozz.us/%F0%9F%90%87.gmi ?Hoppity hop?

As it stands, the first variant cannot be handled by gemini -neither in
text/gemini, nor in the protocol itself- with further technical gotchas
such as address resolution and what not along the way.

It must be converted to the second variant, the US-ASCII one.


Let's examine the situation: ?

The capsule author writes this link line in their text editor:

=> gemini://?.mozz.us/?.gmi ?Hoppity hop?

The text editor may or may not change the syntax highlight to indicate an
error with the URL. When saving the file, the text editor has an
opportunity to "correct" the error by itself.

Let's say the text editor is oblivious and the capsule author doesn't run
the file through a linter. The file is ready to be served.

A visitor requests the file. The server has an opportunity to scan the file
before serving, but that would in most cases be a complete waste of
resources, so it doesn't.

The client parses the file. It has a choice to render the link line as a
link or as text. (It could also brake at the first sight of bunny, but
let's assume it doesn't.) The link is only a problem if the visitor is
following it.

At this point, it doesn't matter if the visitor follows a link or writes
the URL in the address bar. The client has a choice to translate or not
translate the URL before making the request.

Domain name resolution is outside of the scope of the Gemini specification,
we don't know if it can handle UTF-8 or not. If the visitor's network
administrator has set up name resolution to accept UTF-8, they should
probably also accept the punycoded version for compatibility.

Let's assume "always punycode" is a safe option, the client has a choice of
being proactive and do the translation or ignore it and let it fail if it
will. I say both options are valid and the Gemini specification should at
most refer to other specifications on this. (The third option to just
refuse to connect is bad.)

Moving on: (We will go back later.)

We have the IP address and the request has reached the server. Let's assume
this is over the regular internet and a punycoded domain is a must.

The server compares "xn--4o8h.mozz.us
<http://xn--4o8h.mozz.us/%F0%9F%90%87.gmi>" with whatever virtual hosts the
server administrator has set up in the configuration file. Is it
unreasonable for the administrator to expect the server software to match
"?.mozz.us" in the configuration file to "xn--4o8h.mozz.us
<http://xn--4o8h.mozz.us/%F0%9F%90%87.gmi>" coming in over the wire?

How about the other way around? It's a local network and ASCII
non-conforming bunnies hops into the server and the administratior has only
specified the punicode in the configuration file. Is it unreasonable to
expect it to match?

Reasonable or not, let's assume the virtual host is set up properly and go
back in time to the client making the request. What do we do about the path?

Should the client "help" the visitor by %-encoding non-ASCII bytes or send
it as is and hope for the best?

Should the client %-encode reserved characters the visitor writes in the
address bar or let them fail?

Anyway, the request reaches the server. "%20" become space and "%2b" become
plus. I see no reason why it would be hard to also convert
"%F0%9F%90%87" into bytes, so I will assume it isn't and wait for server
software programmers to tell me how wrong I am.

So now we have a string of bytes that we can use to fetch the bunny file.
Wait. What happened with the case where the bunny isn't %-encoded? Why
can't servers just blindly accept non-ASCII bytes as is? Is it a library
thing? Anyway, I really should test this in a bunch of languages but I'm
writing this on my phone on my way to work, so instead I present you this
pseudo code:

 ```
"%F0%9F%90%87".url_decode() == "\xF0\x9F\x90\x87".url_decode()
"%F0%9F%90%87".url_decode() == "\xF0\x9F\x90\x87"
"\xF0\x9F\x90\x87" == "?"
 ```

If these 3 lines are all true for the server software, I see no reason to
%-encode those non-ASCII bytes in the client or anywhere else. Surely I
have missed something obvious somewhere. Can anyone help me?

Maybe I just need coffee... ?

-- 
Katarina
(Please regard these ramblings as non-rhetorical)

Link to individual message.

Petite Abeille <petite.abeille (a) gmail.com>

📅 Sent: 2020-12-18 09:47
📧 Message 27 of 34

> On Dec 18, 2020, at 07:13, Katarina Eriksson <gmym at coopdot.com> wrote:
> 
> Help! I'm getting pulled in! 

Katarina! Thanks for dropping by! Welcome to the party!  ? ?

> I'm assuming including me here was intentional. I truly can't tell if 
that is an accurate description of my possession.

Thanks for noticing. Timing is everything. See Cunningham's Law ?

> "I used to be in the US-ASCII only camp" refers to me no longer thinking 
requiring everything to be encoded to pass as US-ASCII is the best idea. 
This is me moving away from the status quo towards a possible compromise. 
Or am I missing where we're going?

Indeed, this is the crux of the issue, the notorious IRI vs. URI chasm: 
native UTF vs ASCII encoded. 

> I see no reason to %-encode those non-ASCII bytes in the client or 
anywhere else. Surely I have missed something obvious somewhere. Can anyone help me?

Genau. As it stands, the spec mandates URIs -therefore ASCII only- making 
UTF IRIs V E R B O T E N! NICHT GUT! NOT COMPLIANT! ? ?

Now that we all took time to survey the lay of the land, the question is: 
should the  specification be amended to refer to IRI (urn:ietf:rfc:3987), 
instead of URI (urn:ietf:rfc:3986)? 

As simple as that.

That's all folks! ????

Link to individual message.

Björn Wärmedal <bjorn.warmedal (a) gmail.com>

📅 Sent: 2020-12-18 09:59
📧 Message 28 of 34

>> => teddybearoftheyear.com/vote?????%20??????? ??????? ?? ????! [1] [2] [3]

On a technical note: in some libraries you may have to split the URL
and encode the path, fragment, query string and parameters separately.
Otherwise the separators (#, ?, ;) may be encoded as part of the path.

... For me as an implementer this is starting to look a bit
frustrating. I want to please people, but I also don't want to have
all the lifejoy sucked out of me because I have to twist myself into
knots in order to properly understand and implement the protocol.

I'll bow out of this discussion now and follow it from a distance,
hoping that whatever decision is reached isn't too complicated to
implement.

Cheers,
ew0k

Link to individual message.

Petite Abeille <petite.abeille (a) gmail.com>

📅 Sent: 2020-12-18 10:35
📧 Message 29 of 34



> On Dec 18, 2020, at 10:59, Bj?rn W?rmedal <bjorn.warmedal at gmail.com> wrote:
> 
> I'll bow out of this discussion now and follow it from a distance,
> hoping that whatever decision is reached isn't too complicated to
> implement.

The /mild/ complexity arises from URI escaping rules. If anything, IRI 
simplify things a bit for all concerned.

Consider the following URIs (as per the current spec, which all MUST 
support one way or another):

=> gemini://rabbit.hole/bunny%20%26%20carrot.gmi Bunny & Carrot: Down The 
Rabbit Hole, a journey.
=> gemini://rabbit.hole/%F0%9F%90%B0%20%26%20%F0%9F%A5%95.gmi ? & ?: Down 
The Rabbit Hole, a journey.
=> gemini://xn--yn8h.hole/%F0%9F%90%B0%20%26%20%F0%9F%A5%95.gmi ? & ?: 
Down The Rabbit Hole, a journey.

vs. IRIs:

=> gemini://rabbit.hole/bunny%20%26%20carrot.gmi Bunny & Carrot: Down The 
Rabbit Hole, a journey.
=> gemini://rabbit.hole/?%20%26%20?.gmi ? & ?: Down The Rabbit Hole, a journey.
=> gemini://?.hole/?%20%26%20?.gmi ? & ?: Down The Rabbit Hole, a journey.

Link to individual message.

Gary Johnson <lambdatronic (a) disroot.org>

📅 Sent: 2020-12-18 17:16
📧 Message 30 of 34

Katarina Eriksson <gmym at coopdot.com> writes:
>
> Anyway, the request reaches the server. "%20" become space and "%2b" become
> plus. I see no reason why it would be hard to also convert
> "%F0%9F%90%87" into bytes, so I will assume it isn't and wait for server
> software programmers to tell me how wrong I am.
>
> So now we have a string of bytes that we can use to fetch the bunny file.
> Wait. What happened with the case where the bunny isn't %-encoded? Why
> can't servers just blindly accept non-ASCII bytes as is? Is it a library
> thing? Anyway, I really should test this in a bunch of languages but I'm
> writing this on my phone on my way to work, so instead I present you this
> pseudo code:
>
> *ELIDED TEXT HERE*
>
> If these 3 lines are all true for the server software, I see no reason to
> %-encode those non-ASCII bytes in the client or anywhere else. Surely I
> have missed something obvious somewhere. Can anyone help me?

The Space Age server uses java.net.URI to parse incoming URI strings
into their component parts. It can accept URIs with unencoded UTF-8
path, query, and fragment parts (except that spaces must be
percent-encoded as %20). Unicode is not allowed in the hostname part.

One more data point for you,
  Gary

-- 
GPG Key ID: 7BC158ED
Use `gpg --search-keys lambdatronic' to find me
Protect yourself from surveillance: https://emailselfdefense.fsf.org
=======================================================================
()  ascii ribbon campaign - against html e-mail
/\  www.asciiribbon.org   - against proprietary attachments

Why is HTML email a security nightmare? See https://useplaintext.email/

Please avoid sending me MS-Office attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html

Link to individual message.

Petite Abeille <petite.abeille (a) gmail.com>

📅 Sent: 2020-12-18 21:26
📧 Message 31 of 34



> On Dec 18, 2020, at 18:16, Gary Johnson <lambdatronic at disroot.org> wrote:
> 
> The Space Age server uses java.net.URI to parse incoming URI strings
> into their component parts. It can accept URIs with unencoded UTF-8
> path, query, and fragment parts (except that spaces must be
> percent-encoded as %20). Unicode is not allowed in the hostname part.

Perhaps of interest:
xbib/net: Sane URL, URI, IRI implementations for Java
https://github.com/xbib/net

Link to individual message.

Gary Johnson <lambdatronic (a) disroot.org>

📅 Sent: 2020-12-18 23:38
📧 Message 32 of 34

Petite Abeille <petite.abeille at gmail.com> writes:
> Perhaps of interest:
> xbib/net: Sane URL, URI, IRI implementations for Java
> https://github.com/xbib/net

Thanks for the link. I gave it a shot, but it appears to be buggy and
doesn't have any documentation. I ended up reading through the source
code on Github to figure out how to call its API, but sadly it looks
like it can't correctly identify the host part of the incoming string.
Instead, it thinks it is part of the path, which is obviously no good.

space-age.requests> (parse-url 
"gemini://?.mozz.us/%20?.gmi?some-key=?&?=some-value#?-fragment")
{:path "/?.mozz.us/ ?.gmi",
 :raw-query "some-key=%F0%9F%90%87&%F0%9F%90%87=some-value",
 :fragment "?-fragment",
 :params ["some-key=?" "?=some-value"],
 :port 1965,
 :host "",
 :raw-fragment "%F0%9F%90%87-fragment",
 :uri
 "gemini://?.mozz.us/%20?.gmi?some-key=?&?=some-value#?-fragment",
 :query "some-key=?&?=some-value",
 :raw-path "/%F0%9F%90%87.mozz.us/%20%F0%9F%90%87.gmi",
 :raw-host "",
 :scheme "gemini"}

Oh well, I guess I'll stick with java.net.URI for now.

Cheers,
  Gary

-- 
GPG Key ID: 7BC158ED
Use `gpg --search-keys lambdatronic' to find me
Protect yourself from surveillance: https://emailselfdefense.fsf.org
=======================================================================
()  ascii ribbon campaign - against html e-mail
/\  www.asciiribbon.org   - against proprietary attachments

Why is HTML email a security nightmare? See https://useplaintext.email/

Please avoid sending me MS-Office attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html

Link to individual message.

Petite Abeille <petite.abeille (a) gmail.com>

📅 Sent: 2020-12-19 22:29
📧 Message 33 of 34

> On Dec 19, 2020, at 00:38, Gary Johnson <lambdatronic at disroot.org> wrote:
> 
> Thanks for the link. I gave it a shot, but it appears to be buggy

Right. It appears to know a fixed list of schemes. See SchemeRegistry:

https://github.com/xbib/net/blob/master/net-url/src/main/java/org/xbib/net/
scheme/SchemeRegistry.java#L18

Perhaps one needs to register its own to extend it. perhaps something 
similar to HttpScheme:

https://github.com/xbib/net/blob/master/net-url/src/main/java/org/xbib/net/
scheme/HttpScheme.java#L27

Link to individual message.

Philip Linde <linde.philip (a) gmail.com>

📅 Sent: 2020-12-24 20:30
📧 Message 34 of 34

On Fri, 18 Dec 2020 07:13:24 +0100
Katarina Eriksson <gmym at coopdot.com> wrote:

> Domain name resolution is outside of the scope of the Gemini specification,
> we don't know if it can handle UTF-8 or not. If the visitor's network
> administrator has set up name resolution to accept UTF-8, they should
> probably also accept the punycoded version for compatibility.

IDNA moves what is ideally part of DNS into the application layer,
which is what the A stands for. It was somehow decided when adopting
this standard that it was better that every application that wants to
use a hostname should implement IDNA than to fix the underlying problem
in DNS.

This probably helped adoption early on because ISPs could largely leave
the cards in their card houses as they were, but creates more of a
burden for application developers, which in the long run is more
expensive.

So no, at least IDNA has to be supported by the application.

> Why can't servers just blindly accept non-ASCII bytes as is?

A fully compliant RFC 3986 implementation can't accept non-ASCII
characters. If that's what you have, you'll have to rewrite or replace
it. RFC 3987 covers this, but it's a bit more specific than blindly
accepting non-ASCII bytes. The chapters on the comparison ladder is a
good read for an overview of what may need to be implemented to avoid
false negative matching.

-- 
Philip

Link to individual message.

---

Previous Thread: Proposal: Rabbits in gemtext

Next Thread: Synchronizing bookmarks - Request for comments