Three possible uses for IRIs

1. John Cowan (cowan (a) ccil.org)

(Starting a separate thread for this)

I think there are three possible places where IRIs could possibly appear in
Gemini:

1) In client inputs (the address bar or CLI analogue) and outputs
(revealing a link)

2) In the Gemini protocol

3) In text/gemini link lines

I think it's important to disentangle these three cases.  Case 1 just
affects individual clients and can be left up to them, except that there is
some best-practice advice about when *n?t* to display an IRI, specifically
when there are cross-script confusables involved.  For example,
"gemini://gemini.circumlunar.xn--spce-63d/" should not be displayed as
"gemini://gemini.circumlunar.sp?ce", because that would be deceptive, even
in Gemini: you might be pointed to the Evil Version of the Gemini spec and
not realize it.

I think everyone agrees that Case 2 is a mistake: the protocol elements
should continue to be URIs.

Case 3 is the difficult one.  Should authors be allowed to write
text/gemini links with IRI references? It's not that hard for a client to
convert them to URI references.  No normalization is needed except as part
of punycoding.  However, everyone has to agree on whether this should work
or not; we don't want a user trying to follow a link and sending the Wrong
Thing to the server.

Gemini isn't just supposed to be easy to program for, it's supposed to be
easy to author, too.  Unfortunately these objectives are in conflict here.



John Cowan          http://vrici.lojban.org/~cowan        cowan at ccil.org
Lope de Vega: "It wonders me I can speak at all.  Some caitiff rogue
did rudely yerk me on the knob, wherefrom my wits yet wander."
An Englishman: "Ay, belike a filchman to the nab'll leave you
crank for a spell." --Harry Turtledove, Ruled Britannia
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201207/ed25
588c/attachment.htm>

Link to individual message.

2. colecmac (a) protonmail.com (colecmac (a) protonmail.com)

> (Starting a separate thread for this)

Thanks, that's helpful. Hopefully all IRI discussion can move here.

> Case 3 is the difficult one.? Should authors be allowed to write text/gemini
> links with IRI references??It's not that hard for a client to convert them
> to URI references.? No normalization is needed except as part of punycoding.
> However, everyone has to agree on whether this should work or not; we don't
> want a user trying to follow a link and sending the Wrong Thing to the server.

"It's not that hard for a client to convert them to URI references. No
normalization is needed except as part of punycoding."

I don't think that's true. To convert them to a URI reference, the domain needs
to be extracted and punycoded, then the path and query string needs to be
extracted and percent-encoded in the blessed Gemini way that doesn't allow plus
signs. Doing all this requires parsing, and as I explained a couple times in the
other thread, IRI parsing is not feasible across multiple programming languages
at this time, the libraries just don't exist.

And what if the IRI is a relative reference? As I explained in the other thread,
this will definitely require IRI parsing.

Furthermore, it's breaking change to Gemini. I don't think that's a good idea in
any case with the possible exception of TLS security. Gemini must be reliable,
and it's too late for a breaking change.

> Gemini isn't just supposed to be easy to program for, it's supposed to be easy to
> author, too.? Unfortunately these objectives are in conflict here.

Yes, and that's unfortunate. But I think it makes sense for the stability of Gemini
and the ease of programming to come first.


Cheers,
makeworld

Link to individual message.

3. John Cowan (cowan (a) ccil.org)

On Mon, Dec 7, 2020 at 9:47 PM <colecmac at protonmail.com> wrote:


> I don't think that's true. To convert them to a URI reference, the domain
> needs
> to be extracted and punycoded,


Agreed.  But if you have a Punycode encoder, then the following steps will
convert an IRI reference to a URI reference, without regard to whether it
is an IRI or a relative reference:

1) Look in the IRI reference for a "//" and a following "/"; if they exist,
pass the characters in between through your encoder and substitute the
result into the IRI reference.

2) Start over from the beginning.  If a character is ASCII, leave it
unchanged.  Otherwise, take the character, convert it to UTF-8 bytes (easy)
and each byte to hex digits (trivial), decorate it with leading %
(trivial), and move on.  When you come to the end, stop.


> Furthermore, it's breaking change to Gemini. I don't think that's a good
> idea in
> any case with the possible exception of TLS security. Gemini must be
> reliable,
> and it's too late for a breaking change.
>

Probably true.  ~~sigh~~



John Cowan          http://vrici.lojban.org/~cowan        cowan at ccil.org
Normally I can handle panic attacks on my own; but panic is, at the moment,
a way of life.                 --Joseph Zitt
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201207/67b2
3603/attachment.htm>

Link to individual message.

4. Emma Humphries (ech (a) emmah.net)

On Mon, Dec 7, 2020, at 18:47, colecmac at protonmail.com wrote:

> 
> Yes, and that's unfortunate. But I think it makes sense for the 
> stability of Gemini
> and the ease of programming to come first.

I'm perplexed that "ease of programming" is considered more important than 
"ease of adoption."

You mention that not every language supports the libraries needed for 
internationalized URLs. 

What does that lose the project vs. accessibility and broader adoption by 
non-English-speaking users for who Gemini would be a boon with limited 
bandwidth and hardware?

I feel like I'm missing something with the emphasis on ease of client 
implementation over adoption.

Emma Humphries
gemini://gemini.djinn.party/

Link to individual message.

5. colecmac (a) protonmail.com (colecmac (a) protonmail.com)

> > Yes, and that's unfortunate. But I think it makes sense for the
> > stability of Gemini
> > and the ease of programming to come first.
>
> I'm perplexed that "ease of programming" is considered more important 
than "ease of adoption."
>
> You mention that not every language supports the libraries needed for 
internationalized URLs.
>
> What does that lose the project vs. accessibility and broader adoption 
by non-English-speaking users for who Gemini would be a boon with limited 
bandwidth and hardware?
>
> I feel like I'm missing something with the emphasis on ease of client 
implementation over adoption.


I was unsure when I wrote that, and I was worried it would be controversial.
But I still think it makes sense. For Gemini to be accessible, have "broader
adoption", and be "a boon" as you mention, clients need to be easy to write
and maintain. Otherwise, what will these non-English speaking users browse
and serve their content with? A few clients and servers, likely not written
in their native language?

Gemini is a non-commercial hobby project for all the developers I am aware of,
and there are advantages to that. But it also means that if the protocol is hard
to implement, the whole community suffers, because there will be fewer clients
and servers.

The fact that writing URLs for non-English languages is difficult sucks. But
due the complexity, and most of all the fact that this would be a breaking
change, I don't see IRIs as an option.


makeworld


P.S. I'll admit I'm biased. I write more code for Gemini than I do content, and
primarily use my native language English.

Link to individual message.

6. Sean Conner (sean (a) conman.org)

It was thus said that the Great colecmac at protonmail.com once stated:
> 
> The fact that writing URLs for non-English languages is difficult sucks. But
> due the complexity, and most of all the fact that this would be a breaking
> change, I don't see IRIs as an option.

  I thought I might see what's involved with handling IRIs.

  The actual differences between RFC-3986 (URI) and RFC-3987 (IRI) besides
one being a standard (URI) and one being a proposed standard (IRI) comes
down to the characters that are accepted---the unreserved set 

   unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"

becomes

   iunreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar
   ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
                  / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
                  / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
                  / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
                  / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
                  / %xD0000-DFFFD / %xE1000-EFFFD

and the query portion changes from

   query         = *( pchar / "/" / "?" )
   pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"


to

   iquery         = *( ipchar / iprivate / "/" / "?" )
   ipchar         = iunreserved / pct-encoded / sub-delims / ":" / "@" 
   iprivate       = %xE000-F8FF / %xF0000-FFFFD / %x100000-10FFFD

and that's it as far as the RFCs go (aside from the rule name changes).  As
a quick proof-of-concept, I just accepted all non-control UTF-8 characters
as unreserved (including the private space) as that was the easiest thing to
do, and yes, it works (but does allow potentially bad IRIs through).  

  But the code to *build* a URL from the parsed representation [2] ssumes
US-ASCII.  Again, it would take just a few small changes to allow UTF-8
characters on input and escape them properly for a URL.  That's something
I'll try working on tomorrow.

  That still leaves the question of punycode [3] and Unicode normalization
(ugh).

> P.S. I'll admit I'm biased. I write more code for Gemini than I do content, and
> primarily use my native language English.

  I am biased too, as a monolingual US mutt, but I do want to try this stuff
out.

  -spc

[1]	https://github.com/spc476/LPeg-Parsers/blob/master/url.lua

[2]	https://github.com/spc476/GLV-1.12556/blob/master/Lua/GLV-1/url-util.lua

[3]	RFC-3492, which includes C code to encode and decode punycode text,
	which is valgrind clean (I checked).

Link to individual message.

7. Philip Linde (linde.philip (a) gmail.com)

On Mon, 7 Dec 2020 23:00:01 -0500
John Cowan <cowan at ccil.org> wrote:
 
> Agreed.  But if you have a Punycode encoder, then the following steps will
> convert an IRI reference to a URI reference, without regard to whether it
> is an IRI or a relative reference:
> 
> 1) Look in the IRI reference for a "//" and a following "/"; if they exist,
> pass the characters in between through your encoder and substitute the
> result into the IRI reference.
> 
> 2) Start over from the beginning.  If a character is ASCII, leave it
> unchanged.  Otherwise, take the character, convert it to UTF-8 bytes (easy)
> and each byte to hex digits (trivial), decorate it with leading %
> (trivial), and move on.  When you come to the end, stop.

There's a "drawl the owl" step somewhere here regarding Unicode
normalization. Does the server like your ?:s fully composed or
decomposed, or should the server itself be responsible for
normalization?

-- 
Philip
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 488 bytes
Desc: not available
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201208/cf4d
c147/attachment.sig>

Link to individual message.

8. Petite Abeille (petite.abeille (a) gmail.com)



> On Dec 8, 2020, at 05:06, Emma Humphries <ech at emmah.net> wrote:
> 
> I'm perplexed that "ease of programming" is considered more important 
than "ease of adoption."

Or basic use for that matter.

Gemini's narrative is build upon the fallacy that everything is 
"just-oh-so-trivial": any Dick and Jane can fire up the most bare bone of 
telnet over their trusty dial-up modem and be done. Some sort of 
citizen-programmer-publisher nirvana, without any barriers to entry whatsoever. 

Admirable.

But not quite practical. The internet stack is deep, old, and brittle. Tooling matter. 

Still, all very admirable :)

Long live Gemini.

Link to individual message.

9. Philip Linde (linde.philip (a) gmail.com)

On Mon, 07 Dec 2020 20:06:27 -0800
"Emma Humphries" <ech at emmah.net> wrote:

> I'm perplexed that "ease of programming" is considered more important 
than "ease of adoption."

Consider "ease of programming" and in particular stability a subset of
"ease of adoption". There are numerous client and server
implementations because it is easy to implement, and easy to maintain
because the protocol is relatively stable even in these early stages.
The different software allows people with different goals to adopt the
protocol, and helps in weeding out shortcomings of clarity in the
specification by analysis of the subtle differences between
implementations.

> You mention that not every language supports the libraries needed for 
internationalized URLs. 
> 
> What does that lose the project vs. accessibility and broader adoption 
by non-English-speaking users for who Gemini would be a boon with limited 
bandwidth and hardware?

It seems more likely that a change to this end would hurt adoption.
Numerous pieces of existing Gemini software would immediately be
invalidated. Not all of them will be updated to accommodate the change.
I could perhaps see a more pressing need for the change if internet
users worldwide weren't already used to transliteration. It's such a
small part as well. UTF-8 is acceptable (and default) in text/gemini
documents, and the text content of a capsule can indeed be written in
any of the scripts supported by Unicode.

-- 
Philip
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 488 bytes
Desc: not available
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201208/e511
9f33/attachment.sig>

Link to individual message.

10. Stephane Bortzmeyer (stephane (a) sources.org)

On Mon, Dec 07, 2020 at 09:37:28PM -0500,
 John Cowan <cowan at ccil.org> wrote 
 a message of 117 lines which said:

> I think there are three possible places where IRIs could possibly appear in
> Gemini:

A fourth one is server configuration. (When you declare the virtual
host, for instance.)

> some best-practice advice about when *n?t* to display an IRI, specifically
> when there are cross-script confusables involved.  For example,
> "gemini://gemini.circumlunar.xn--spce-63d/" should not be displayed as
> "gemini://gemini.circumlunar.sp?ce", because that would be deceptive, even
> in Gemini: you might be pointed to the Evil Version of the Gemini spec and
> not realize it.

As I said, I regard "homograph attacks" as mostly a tale to discourage
people to use Unicode. They are not a real-world problem.

Link to individual message.

11. Stephane Bortzmeyer (stephane (a) sources.org)

On Tue, Dec 08, 2020 at 02:47:22AM +0000,
 colecmac at protonmail.com <colecmac at protonmail.com> wrote 
 a message of 50 lines which said:

> Furthermore, it's breaking change to Gemini. I don't think that's a
> good idea in any case with the possible exception of TLS
> security. Gemini must be reliable, and it's too late for a breaking
> change.

Hold on. I'm a newbie in Gemini and I was under the impression that
Gemini is still experimental and the specification still in flux. If
it is not true, if Gemini is frozen and "take it or leave it", that's
a different matter, and we could save some time by rejecting many
discussions.

Link to individual message.

12. Sean Conner (sean (a) conman.org)

It was thus said that the Great Stephane Bortzmeyer once stated:
> On Tue, Dec 08, 2020 at 02:47:22AM +0000,
>  colecmac at protonmail.com <colecmac at protonmail.com> wrote 
>  a message of 50 lines which said:
> 
> > Furthermore, it's breaking change to Gemini. I don't think that's a
> > good idea in any case with the possible exception of TLS
> > security. Gemini must be reliable, and it's too late for a breaking
> > change.
> 
> Hold on. I'm a newbie in Gemini and I was under the impression that
> Gemini is still experimental and the specification still in flux. If
> it is not true, if Gemini is frozen and "take it or leave it", that's
> a different matter, and we could save some time by rejecting many
> discussions.

  I know Solderpunk wants to do a series of freezes then thaws as things are
worked on, but I think things progress a bit faster than he can deal with,
or wants to deal with, given his long absences on the list.

  For me personally, I think this should be worked out, and I'm working
towards that with my own server [1].  I've had to make changes to
GLV-1.12556 in the past when the protocol changed, I can change it again.

  -spc

[1]	https://github.com/spc476/GLV-1.12556

Link to individual message.

13. William Orr (will (a) worrbase.com)


Philip Linde writes:

> On Mon, 07 Dec 2020 20:06:27 -0800
> "Emma Humphries" <ech at emmah.net> wrote:
>
>> I'm perplexed that "ease of programming" is considered more important 
than "ease of adoption."
>
> Consider "ease of programming" and in particular stability a subset of
> "ease of adoption". There are numerous client and server
> implementations because it is easy to implement, and easy to maintain
> because the protocol is relatively stable even in these early stages.
> The different software allows people with different goals to adopt the
> protocol, and helps in weeding out shortcomings of clarity in the
> specification by analysis of the subtle differences between
> implementations.
>
>> You mention that not every language supports the libraries needed for 
internationalized URLs. 
>> 
>> What does that lose the project vs. accessibility and broader adoption 
by non-English-speaking users for who Gemini would be a boon with limited 
bandwidth and hardware?
>
> It seems more likely that a change to this end would hurt adoption.
> Numerous pieces of existing Gemini software would immediately be
> invalidated. Not all of them will be updated to accommodate the change.
> I could perhaps see a more pressing need for the change if internet
> users worldwide weren't already used to transliteration. It's such a
> small part as well. UTF-8 is acceptable (and default) in text/gemini
> documents, and the text content of a capsule can indeed be written in
> any of the scripts supported by Unicode.

Hey,

I'm new to this list, and a new Gemini user, but this topic is fairly 
important to me. It's discouraging to see a lot of fear-mongering around 
this topic already.

Some points that have come up a few times already in this thread as well 
as the IDN thread that I think are worth addressing:

1. Homograph attacks

Stephane has already mentioned in a different response that homograph 
attacks are fairly rare. I don't have the knowledge to say whether or not 
that's accurate, but I can speak to how they're mitigated.

In general, browsers will render the domain in the URI bar if all of the 
characters in the each section belong to the same script. As an example, 
https://?pple.com will not render correctly in Firefox in the URI bar, but 
https://?????.com/ will render correctly (both domains do not exist if you 
want to check).

The other half of this comes down to domain registrars not allowing 
registrations of domains with homographs (depends on the TLD, of course).

What this comes down to, is that Gemini clients, if they wish to mitigate 
this type of attack, should apply the same algorithm as web browsers. 
Again, given the preference for client certs for authenticating sessions, 
it doesn't seem like this attack would have dire consequences anyway.

I also think I saw someone mention that they're worried about it from the 
IRI side as well? That attack doesn't seem like much of a realistic case, 
since if they direct you to a different page on the same server, you're 
well, still on the same server. This only becomes problematic in the case 
of shared hosting of untrusted tenants.

2. Normalization

There's been a bit of fear-mongering about normalization which I can 
totally understand, since a first look at Unicode technical reports and 
the 4 normalization forms looks intimidating at first glance.

However, as pointed out in a few RFCs, NFC is more or less the only 
normalization form that you need to worry about in *most* circumstances. 
Typed URIs should be normalized in NFC, both on server-side and 
client-side. When resolving files to the filesystem, the filename should 
be normalized to NFC. (this all assumes that your fs supports Unicode paths).

NFKC becomes more relevant in the case that you want to implement 
something like search, or find in page, or something. You may want a user 
to be able to type in something like 'e' have their find include 
everything whose NFKC form is basically an 'e' (see the full set here: 
https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3ANFKC_Casef
old%3De%3A%5D&g=&i=).

3. Language support

Normalization is generally supported across different languages p easily.

Python has it in its stdlib: 
https://docs.python.org/3/library/unicodedata.html#unicodedata.normalize

Golang has support: https://pkg.go.dev/golang.org/x/text/unicode/norm

Rust: https://unicode-rs.github.io/unicode-normalization/unicode_normalization/index.html

C get its support through the venerable libicu library (you're already 
using libs for TLS): 
https://unicode-org.github.io/icu/userguide/transforms/normalization/

I will say that I don't know of any explicit IRI-handling libraries, nor 
do I know what the state of support is in different URI-handling 
libraries, but it will be something I play with as I work on gemini 
projects. I'm happy to share my experiences when I have more of them. :)

-

To address some non-technical points, I don't think that starting a new 
protocol and then deciding to ignore internationalization is necessarily 
the right way to go. In a lot of cases, internationalization sucks because 
of legacy support, and gemini doesn't *have* legacy to preserve 
compatibility. As I understand it, that's why TLS is mandatory, even 
though it arguably locks out some retro systems from being able to use it.

Personally, I'd like to see the spec say something about how this is 
handled before any type of freeze takes place.

--
worr

Link to individual message.

14. bie (bie (a) 202x.moe)

>   I know Solderpunk wants to do a series of freezes then thaws as things are
> worked on, but I think things progress a bit faster than he can deal with,
> or wants to deal with, given his long absences on the list.

I'd love to see a spec freeze, too. There are already a lot of gemini
servers, clients and other tools out there and breaking changes should
be avoided unless absolutely necessary.

>   For me personally, I think this should be worked out, and I'm working
> towards that with my own server [1].  I've had to make changes to
> GLV-1.12556 in the past when the protocol changed, I can change it again.

How about waiting for a consensus to develop, *at the very least*?

If the protocol were to change to allow IRIs, that's a *major breaking* 
change that to me, as someone actually serving non-English content, is
not only completely unnecessary but harmful.

1. I would still have to percent-encode my links to stay compatible with
existing clients.
2. With clients now potentially sending IRIs and not encoded URIs as
requests I would have to change the request handling in my server code
to allow for this, possibly having to add third-party dependencies.
3. I'm still not convinced this would help anyone - IRIs still have
reserved characters that have to be properly encoded - so completely 
non-technical text/gemini authors will still have to rely on proper
tooling.

bie

Link to individual message.

15. A. E. Spencer-Reed (easrng (a) gmail.com)

On Tue, Dec 8, 2020 at 5:10 AM Petite Abeille <petite.abeille at gmail.com> wrote:
> Gemini's narrative is build upon the fallacy that everything is 
"just-oh-so-trivial": any Dick and Jane can fire up the most bare bone of 
telnet over their trusty dial-up modem and be done. Some sort of 
citizen-programmer-publisher nirvana, without any barriers to entry whatsoever.
Well, not telnet, because of TLS, but openssl-s_client at least.

Link to individual message.

16. Dmitry Bogatov (gemini#lists.orbitalfox.eu#v1 (a) kaction.cc)

On Tue, Dec 08, 2020 at 12:46:47PM +0100, William Orr wrote:

> In general, browsers will render the domain in the URI bar if all of
> the characters in the each section belong to the same script. As an
> example, https://?pple.com will not render correctly in Firefox in the
> URI bar, but https://?????.com/ will render correctly (both domains do
> not exist if you want to check).

A lot of extra complexity for very little value.

FWIW, first url you showed looks absolutely the same as legit
"https://apple.com" I typed manually in my vim in TERM=linux.

I came to gemini because for web I, inhabitant of /dev/tty1, is
third-class citizen. Please, don't bring this to Gemini. If "curl
gemini://foo.example/" is not good enough, than your feature is too
complicated.

My native language is Russian (which is not even latin-based), and
goverment website has URL of "https://gosuslugi.ru", and everything
works fine. If you ask me, IRI is a huge mistake.

Link to individual message.

17. Petite Abeille (petite.abeille (a) gmail.com)


> On Dec 8, 2020, at 12:49, bie <bie at 202x.moe> wrote:

[2020-12-08T11:31:16.041Z] <bie> fff
[2020-12-08T11:52:29.189Z] <bie> fuck this
[2020-12-08T11:52:32.918Z] <bie> lol
[2020-12-08T11:52:45.090Z] <bie> time to unsubscribe
...

[2020-12-08T12:51:08.517Z] <khuxkm> my favorite was whoever's response to 
you saying that was "oh we're a frozen spec now?"
[2020-12-08T12:51:16.729Z] <khuxkm> like YES we've been a frozen spec since, like, June
...

[2020-12-08T12:51:45.154Z] <makeworld> khuxkm: I pinged Solderpunk on 
Masto and he got back to me very quickly saying he had read the IDN thread 
and was going to come to a decision soon
[2020-12-08T12:51:52.567Z] <makeworld> Yeah lol
[2020-12-08T12:52:08.982Z] <makeworld> I think spc is getting nerd sniped
...

etc, etc, etc...

Certainly you must be aware that the logs from #gemini on tilde.chat are 
fully accessible to everyone who can be bothered, snarky comments & all. For posterity.

https://portal.mozz.us/gemini/makeworld.gq/cgi-bin/gemini-irc

Link to individual message.

18. Petite Abeille (petite.abeille (a) gmail.com)



> On Dec 8, 2020, at 14:17, A. E. Spencer-Reed <easrng at gmail.com> wrote:
> 
> Well, not telnet, because of TLS, but openssl-s_client at least.

TLS-based Telnet Security
https://tools.ietf.org/html/draft-ietf-tn3270e-telnet-tls-00

But yes, some sort of TLS layer of one kind or another  :)

Link to individual message.

19. John Cowan (cowan (a) ccil.org)

On Tue, Dec 8, 2020 at 6:49 AM bie <bie at 202x.moe> wrote:


> If the protocol were to change to allow IRIs, that's a *major breaking*
> change that to me, as someone actually serving non-English content, is
> not only completely unnecessary but harmful.
>

As I said at the beginning of this thread, I don't think anyone is actually
arguing for a change to the protocol.  What does warrant discussion is
allowing IRI references as links in the text/gemini format.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201208/2c55
7dee/attachment.htm>

Link to individual message.

20. colecmac (a) protonmail.com (colecmac (a) protonmail.com)

[2020-12-08T15:01:30.527Z] <makeworld> Hello, Petite Abeille. Apparently 
you're watching the logs, courtesy of my server, so that you can send them 
back on to the mailing list.
[2020-12-08T15:01:50.519Z] <makeworld> I don't see the point, other than 
trying to stir the pot. Please stop.
[2020-12-08T15:02:34.637Z] <bie> ?
[2020-12-08T15:02:39.645Z] <makeworld> The fact that this channel is 
logged is in the topic, it is known. None of the comments you sent were rude, as well.

Link to individual message.

21. colecmac (a) protonmail.com (colecmac (a) protonmail.com)

On Tuesday, December 8, 2020 10:57 AM, John Cowan <cowan at ccil.org> wrote:

> On Tue, Dec 8, 2020 at 6:49 AM bie <bie at 202x.moe> wrote:
> ?
>
> > If the protocol were to change to allow IRIs, that's a *major breaking*
> > change that to me, as someone actually serving non-English content, is
> > not only completely unnecessary but harmful.
>
> As I said at the beginning of this thread, I don't think anyone is actually
> arguing for a change to the protocol.? What does warrant discussion is
> allowing IRI references as links in the text/gemini format.


I think some people really were calling for a breaking change to the protocol.
But I'm glad you're not, and I hope we can move on and stop talking about it.
What you propose here is allowing IRIs in link lines only? Or do you mean allowing
only IRIs for relative references?

I'm unsure whether that would require an IRI parser or not, but I'd feel more
confident with one. However, there is already a client torture test that *sort of*
covers this. It's not designed as an IRI test, but it includes invalid
characters in a link line.

gemini://gemini.conman.org/test/torture/0031

That page contains a link line that looks like this:

=> <0032> "Beware the bad link"

And the Go stdlib will actually correct this link and output a correct
absolute one. So in Amfora, it will go to the correct URL, which is
gemini://gemini.conman.org/test/torture/%3C0032%3E

I've set up my own test that contains a more complex Unicode character: ?.
It tests the path, as well as Unicode in the query strings.
You can access it at: gemini://makeworld.gq/test/iri-link.gmi

Go also corrects the link in that one, and it works. Allowing IRIs in link
lines (maybe only for relative links to ease parsing) would solve all
multi-lingual author problems.

But this is still a somewhat-breaking change, as once authors start using
these, other non-Go clients will likely begin to fail. And the correction
that Go does is not even complete, because it will not work on query strings.
And even if it did, it would not work in the Gemini way that doesn't allow
pluses, etc etc.

We're almost there with this one, but I still think it's a mistake, and it'll
make Gemini more complex. :/


makeworld

Link to individual message.

22. Petite Abeille (petite.abeille (a) gmail.com)



> On Dec 8, 2020, at 18:55, colecmac at protonmail.com wrote:
> 
> [2020-12-08T15:01:50.519Z] <makeworld> I don't see the point, other than 
trying to stir the pot. Please stop.
> [2020-12-08T15:02:39.645Z] <makeworld> The fact that this channel is 
logged is in the topic, it is known. None of the comments you sent were rude, as well.

Did I touch a nerve? Apologies. Back to our regular program then.

Link to individual message.

23. John Cowan (cowan (a) ccil.org)

On Tue, Dec 8, 2020 at 2:09 PM <colecmac at protonmail.com> wrote:


> I think some people really were calling for a breaking change to the
> protocol.
> But I'm glad you're not, and I hope we can move on and stop talking about
> it.
> What you propose here is allowing IRIs in link lines only?


Yes.

> Or do you mean allowing
> only IRIs for relative references?
>

No.

> I'm unsure whether that would require an IRI parser or not,


It will not, because conversion can be done before parsing, other than the
trivial parsing required to find the hostname and punycode it.  Once that
is done, converting an IRI reference to a URI reference is as
straightforward as transcoding from one character set to another, and
totally indifferent to the IRI format.  So my two steps for IRI->URI
conversion become three:

1)  NFC normalization.

2) Punycode conversion of the hostname.

3) Percent-encoding: find non-ASCII characters and convert them to %nn%nn,
or %nn%nn%nn, or %nn%nn%nn%nn sequences, where nn is two hex digits.

It turns out that all of this is spelled out in more detail at <
https://tools.ietf.org/html/rfc3987#section-3.1>.  That section says not to
normalize unless you have the IRI in non-digital or non-UTF* format, but
since the world is not full of editors that normalize, I think Gemini
clients need to do it themselves.   That said, most keyboard drivers (even
for hard cases like Vietnamese, which has way too many vowels to dedicate a
key to each) now deliver normalized text to applications.

It's good to know that some existing URI libraries support IRIs, but that
section should be convincing evidence that you can change an IRI to a URI
without parsing it (always excepting the domain name, which is trivial to
find).

But this is still a somewhat-breaking change, as once authors start using
> these, other non-Go clients will likely begin to fail. And the correction
> that Go does is not even complete, because it will not work on query
> strings.
> And even if it did, it would not work in the Gemini way that doesn't allow
> pluses, etc etc.
>

The above transformation will work, however.  Sometimes DIY is the Right
Thing.

> We're almost there with this one, but I still think it's a mistake, and
> it'll
> make Gemini more complex. :/
>

It will.  But in the end, if Gemini succeeds even modestly there will be
more authors than programmers.

[*] 72 lower-case vowel letters: 6 vowels without diacritics plus 6 vowels
with vowel-quality diacritics, as in French, times 6 tone marks (one of
which is "no mark") as in Chinese.  And the same number in upper case.



John Cowan          http://vrici.lojban.org/~cowan        cowan at ccil.org
It's like if you meet an really old, really rich guy covered in liver
spots and breathing with an oxygen tank, and you say, "I want to be
rich, too, so I'm going to start walking with a cane and I'm going to
act crotchety and I'm going to get liver disease. --Wil Shipley
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201208/6720
d39c/attachment.htm>

Link to individual message.

24. colecmac (a) protonmail.com (colecmac (a) protonmail.com)

> > We're almost there with this one, but I still think it's a mistake, and it'll
> > make Gemini more complex. :/
>
> It will.  But in the end, if Gemini succeeds even modestly there will be more
> authors than programmers.

This is the point that sticks out to me. Perhaps I was wrong.

The method you outlined does not seem that complex, and it really would benefit
authors. It's still a breaking change, but all existing links would still work.

The most difficult part of what you outlined is the Unicode normalization,
which maybe not all languages have libraries for, and would also require
updating every so often. But it wouldn't be a requirement for clients at all,
just something nice to have.

However, it does raise a few questions:

I assume you mean NFC normalization? Any other option seems nonsensical to me,
but I'm also new to this in general. Would be happy to be corrected.

What if the user named a domain/file/folder in a non-NFC way? Now does the server
need to support NFC as well, and apply it to vhost recognition or local file paths
to correctly match requests? That seems wrong. But so does the user entering
something visually identical to what the the the sysadmin typed, and things not
working.

I'm not keen to muddle up the threads again, but it seems like this proposal
completely covers IDNs as well, which is handy.

Overall, I like it. The biggest thing holding me back is the fact that it will
break clients, over time. But perhaps that's worth it for the ease-of-writing
gain for non-English speakers.

I wouldn't mind updating Amfora to support this. As I explained in my previous email,
it sort of already does this by accident.


Cheers,
makeworld

Link to individual message.

25. Sean Conner (sean (a) conman.org)

It was thus said that the Great colecmac at protonmail.com once stated:
> 
> I'm unsure whether that would require an IRI parser or not, but I'd feel more
> confident with one. However, there is already a client torture test that *sort of*
> covers this. It's not designed as an IRI test, but it includes invalid
> characters in a link line.
> 
> gemini://gemini.conman.org/test/torture/0031
> 
> That page contains a link line that looks like this:
> 
> => <0032> "Beware the bad link"
> 
> And the Go stdlib will actually correct this link and output a correct
> absolute one. So in Amfora, it will go to the correct URL, which is
> gemini://gemini.conman.org/test/torture/%3C0032%3E

  What you failed to quote from that test is:

	I'm not entirely sure what the proper response should be ...

  And it was a last minute thing to add the link to %3C0032%3E---I was
thinking it was more of an Easter Egg type of thing than what the actual
result should be.  

> I've set up my own test that contains a more complex Unicode character: ?.
> It tests the path, as well as Unicode in the query strings.
> You can access it at: gemini://makeworld.gq/test/iri-link.gmi

  I tried both the Gemini Client Torture Test 31, and your link with the
Gemini portal at portal.mozz.us.  The results were interesting.  If failed
the Gemini Client Torture Test, but loaded the page with the Unicode
character on your site.  So at least it supports percent encoding of
characters outside the ASCII range.

  -spc (So that's one more data point ... )

Link to individual message.

26. Sean Conner (sean (a) conman.org)

It was thus said that the Great Petite Abeille once stated:
> 
> [2020-12-08T12:52:08.982Z] <makeworld> I think spc is getting nerd sniped

  I don't know if I'm being nerd sniped or not, but I do think this has
brought to my attention some encoding bugs I have---namely, I don't encode
data with non-US-ASCII characters.  Fixing bugs is always A Good Thing (TM). 
I'm also looking into just how hard it would be to support IRIs.  Except for
the normalization thing, it looks to be fairly straightforward, but I
haven't worked on it that much yet.  I've already had my preconceived
notions of DNS blown out of the water over this thread (and I've implemented
a DNS library [1] so that's saying something).

  -spc (What it says, I don't know)

[1]	https://github.com/spc476/SPCDNS

Link to individual message.

27. colecmac (a) protonmail.com (colecmac (a) protonmail.com)

> 3) Percent-encoding: find non-ASCII characters and convert them to %nn%nn,
> or %nn%nn%nn, or %nn%nn%nn%nn sequences, where nn is two hex digits.

One extra thing: Gemini will need it's own list of reserved characters.
The URI spec defines[1] this list:

gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"


(It also defines a list called sub-delims, but that only applies to query
strings I believe, and is irrelevant to the way Gemini uses them.)

These characters are reserved because of their use in other parts of
a URI. But Gemini does not use all those parts, such as userinfo. I
believe a reserved character list for Gemini could look like this:

":" / "/" / "#" / "?" / "[" / "]"

I left fragments ("#") in, so that clients can add support for them later,
if/when a header-to-fragment algorithm is defined, like exists for Markdown.
But that character could be removed too, which would prevent it ever being
used in that manner.

1: https://tools.ietf.org/html/rfc3986#section-2.2


makeworld

Link to individual message.

28. John Cowan (cowan (a) ccil.org)

On Tue, Dec 8, 2020 at 4:10 PM <colecmac at protonmail.com> wrote:


> The most difficult part of what you outlined is the Unicode normalization,
> which maybe not all languages have libraries for, and would also require
> updating every so often. But it wouldn't be a requirement for clients at
> all,
> just something nice to have.
>

If a client has an unnormalized IRI, it needs to normalize it before
sending it to the server.  That said, a 2009 study looked at a sample of
700 million HTML documents, of which only 0.02% were not in NFC already,
which suggests that NFC text is already pretty dominant.

I assume you mean NFC normalization?
>

Yes.  When I speak of normalization, I mean NFC normalization exclusively.

> What if the user named a domain/file/folder in a non-NFC way? Now does the
> server
> need to support NFC as well, and apply it to vhost recognition or local
> file paths
> to correctly match requests? That seems wrong. But so does the user
> entering
> something visually identical to what the sysadmin typed, and things not
> working.
>

I'm okay with that just failing, as file names are not really part of
text/gemini content.  The difference will be obvious to the admin by
checking the requested URIs from the server log against the %-encoded names
of the folders.



John Cowan          http://vrici.lojban.org/~cowan        cowan at ccil.org
                I am a member of a civilization. --David Brin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201208/6b45
b640/attachment.htm>

Link to individual message.

29. John Cowan (cowan (a) ccil.org)

On Tue, Dec 8, 2020 at 4:42 PM <colecmac at protonmail.com> wrote:


> (It also defines a list called sub-delims, but that only applies to query
> strings I believe, and is irrelevant to the way Gemini uses them.)
>

Gemini query strings can certainly be formatted like Web query strings if
the client knows that's what the server expects.  Simple free text isn't
the only possibility.  I'm going to talk about that in a posting at some
point.

These characters are reserved because of their use in other parts of
> a URI. But Gemini does not use all those parts, such as userinfo. I
> believe a reserved character list for Gemini could look like this:
>
> ":" / "/" / "#" / "?" / "[" / "]"
>

We still need the square brackets for the rare case when the host-part is
an IPv6 address.  The only character we could leave out with complete
safety is @, and I don't think that's worth special-casing for Gemini.
It's simpler and better to have the same rules for all URIs.

> I left fragments ("#") in, so that clients can add support for them later,
> if/when a header-to-fragment algorithm is defined, like exists for
> Markdown.
>

+1 to leaving # reserved, not only for that reason but for the same reason
as @; it's not worth making a special rule for Gemini to avoid a trivial
amount of %-encoding, especially given that most file names don't have
either one in their names.



John Cowan          http://vrici.lojban.org/~cowan        cowan at ccil.org
weirdo:    When is R7RS coming out?
Riastradh: As soon as the top is a beautiful golden brown and if you
stick a toothpick in it, the toothpick comes out dry.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201208/4739
3bea/attachment.htm>

Link to individual message.

30. colecmac (a) protonmail.com (colecmac (a) protonmail.com)

> > The most difficult part of what you outlined is the Unicode normalization,
> > which maybe not all languages have libraries for, and would also require
> > updating every so often. But it wouldn't be a requirement for clients at all,
> > just something nice to have.
>
> If a client has an unnormalized IRI, it needs to normalize it before sending
> it to the server.

Could you justify this? It's a good thing to have, but it feels like a big ask,
as Unicode support and especially things like normalization are not straightforward
in all languages. I don't see why it can't just be recommended and not required.

> > I assume you mean NFC normalization?
>
> Yes.? When I speak of normalization, I mean NFC normalization exclusively.

Sounds good!

> > What if the user named a domain/file/folder in a non-NFC way? Now does the server
> > need to support NFC as well, and apply it to vhost recognition or local file paths
> > to correctly match requests? That seems wrong. But so does the user entering
> > something visually identical to what the sysadmin typed, and things not
> > working.
>
> I'm okay with that just failing, as file names are not really part of text/gemini
> content.? The difference will be obvious to the admin by checking the requested
> URIs from the server log against the %-encoded names of the folders.

The issue is that admins are not the only ones who create folders and files.
Non-technical people will as well, and a bug like this will be very confusing.
Everything will look right, but it just won't work. However, I doubt this will
occur very often, and it's an acceptable tradeoff to supporting Unicode.


makeworld

Link to individual message.

31. Sean Conner (sean (a) conman.org)

It was thus said that the Great bie once stated:
> >   I know Solderpunk wants to do a series of freezes then thaws as things are
> > worked on, but I think things progress a bit faster than he can deal with,
> > or wants to deal with, given his long absences on the list.
> 
> I'd love to see a spec freeze, too. There are already a lot of gemini
> servers, clients and other tools out there and breaking changes should
> be avoided unless absolutely necessary.
> 
> >   For me personally, I think this should be worked out, and I'm working
> > towards that with my own server [1].  I've had to make changes to
> > GLV-1.12556 in the past when the protocol changed, I can change it again.
> 
> How about waiting for a consensus to develop, *at the very least*?

  If I waited for consensus, Gemini would not be where it is today [1].
Also, it brought out a what I consider a bug in my code (generating links
from filenames) that it doesn't properly URL encode data [2].

> If the protocol were to change to allow IRIs, that's a *major breaking* 
> change that to me, as someone actually serving non-English content, is
> not only completely unnecessary but harmful.

  I don't expect that an IRI will be allowed for a request, but that an IRI
could be in a Gemini text file and it's up to the client to do the
conversion.  And it's that bit that I'm currently exploring.  

> 3. I'm still not convinced this would help anyone - IRIs still have
> reserved characters that have to be properly encoded - so completely 
> non-technical text/gemini authors will still have to rely on proper
> tooling.

  And we won't know until somebody tries.

  -spc

[1]	There's a reason why GLV-1.12556 and gemini.conman.org were the
	first Gemini server software and server in existance, becauxe I just
	went ahead and implemented it while solderpunk was still talking
	about it.  And I think the presense of GLV-1.12556 and
	gemini.conman.org sparked others to get busy.  And GLV-1.12556 was
	*NOT* following the specification at the time, as I disagreed with
	parts of the specification.

[2]	I don't have any non-ASCII file names, so it never crossed my mind
	to handle such things.  That is a blind spot as far as I'm
	concerned.

Link to individual message.

32. Jason McBrayer (jmcbray (a) carcosa.net)

colecmac at protonmail.com writes:

> The issue is that admins are not the only ones who create folders and files.
> Non-technical people will as well, and a bug like this will be very confusing.
> Everything will look right, but it just won't work. However, I doubt this will
> occur very often, and it's an acceptable tradeoff to supporting Unicode.

It's arguably worse than that; consider the case where your filesystem
doesn't store filenames in UTF-8 ? notably Windows stores them in UCS2.
If you're treating filenames as Unicode strings and not byte arrays, and
your language provides good abstractions for that, you're okay, but the
upshot is that both the client and the server really do need to be
Unicode aware.

-- 
+-----------------------------------------------------------+
| Jason F. McBrayer                    jmcbray at carcosa.net  |
| A flower falls, even though we love it; and a weed grows, |
| even though we do not love it.            -- Dogen        |

Link to individual message.

---

Previous Thread: [ANN] A Nagios (and compatible) monitoring plugin for Gemini servers

Next Thread: Crawlers on Gemini and best practices