💾 Archived View for gemi.dev › gemini-mailing-list › 000557.gmi captured on 2024-06-16 at 13:42:37. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

[spec] IRIs, IDNs, and all that international jazz

📧 Messages: 109
🗣️ Authors: 23
📅 First Message: 2020-12-22 15:13
📅 Last Message: 2020-12-27 01:40

1. Solderpunk (solderpunk (a) posteo.net)

📅 Sent: 2020-12-22 15:13
📧 Message 1 of 109

Hi folks,

Okay, I'm finally getting involved in this discussion.  Sorry it took
me a while, and thanks for your patience.  Here's a characteristically
long email detailing how my thinking on this front has evolved in
just the past few days, starting a new thread with the [spec] topic tag.

My a priori thoughts when it became clear that this discussion was
turning into a major issue, but before I had delved into any details,
were something like this:

"Good support for arbitrary languages in Gemini is *important* and
worth putting up with a little bit of pain for.  This is the reason
the `lang` parameter was defined for the text/gemini media type,
because a text encoding alone is not sufficient for a client to know
to do what native speakers of some languages expect (like render
text right to left).  As weird and foreign as this stuff might seem
to a lot of people, only one (English) of the ten most widely spoken
languages in the world (and this doesn't change whether you count
only native speakers or all speakers) can be properly represented
in ASCII, so bailing on unicode support when it seems too hard is
very hard to justify and we should try hard to do the right thing.
That said, there obviously has to be an upper limit on complexity.
Hopefully we can strike a good balance..."

At this point, I'll also add that it was obviously my intention from
the very early days that internationalised URLs "just work" in Gemini.
The clue to this is that the spec defines Gemini requests in terms
of "UTF-8 encoded URLs".  Now that I'm a little wiser about these
things I realise that URIs (and hence URLs) by definition contain only
characters which are encoded identially in UTF-8 and ASCII, so that
"UTF-8 encoded URL", while not a contradiction of any sort, is not
a particularly powerful concept and does nothing to achieve i18n.
But I was certainly naively hoping that it did.  In my ideal world,
something like an IRI would absolutely work in Gemini with a minimum
of fuss.

Anyhow, the other night I read RFC 3987.  Not word for word, mind you,
but more than a casual skim.  At which point my thoughts became:

"Why on Earth is everybody on the ML banging on about punycode this
and normalisation that?  None of that would be relevant for Gemini.
That complexity is only required to transform IRIs into URIs, which
is a workaround for legacy software, document formats and protocols
which can't handle IRIs directly.  Gemini isn't legacy - if we did a
`s/URL/IRL/g` on the spec, we could just pass around UTF-8 encoded
IRLs without any of this hassle and things would just work.  The spec
already [somewhat mistakenly: see above] makes it clear that UTF-8 is
to be expected in requests.  This is a trivial change, not breaking
at all, let's just do it.

Of course, conversion of IDNs to punycode for the sake of DNS lookups
would still be required because we can't change the reality of deployed
DNS infrastructure, but it's insane to think this is the responsibility
of every individual client author, it's up to operating systems and
standard libraries to abstract this away.  Surely they already do this?
Let's check...

Python 3.7.3 (default, Apr  3 2019, 05:39:12)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import socket
>>> socket.getaddrinfo("r?ksm?rg?s.josefsson.org", 1965)
[(<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_STREAM: 1>, 6, '', 
('2001:9b1:8633::102', 1965, 0, 0)), (<AddressFamily.AF_INET6: 10>, 
<SocketKind.SOCK_DGRAM: 2>, 17, '', ('2001:9b1:8633::102', 1965, 0, 0)), 
(<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_RAW: 3>, 0, '', 
('2001:9b1:8633::102', 1965, 0, 0)), (<AddressFamily.AF_INET: 2>, 
<SocketKind.SOCK_STREAM: 1>, 6, '', ('178.174.241.102', 1965)), 
(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_DGRAM: 2>, 17, '', 
('178.174.241.102', 1965)), (<AddressFamily.AF_INET: 2>, 
<SocketKind.SOCK_RAW: 3>, 0, '', ('178.174.241.102', 1965))]

Yep, great, wonderful, Python does the punycode stuff for me invisibly,
this adds no extra complexity at all!"

At this point I was, in my private thoughts, a pretty hardcore IRI
advocate, and didn't understand why anybody wouldn't be.

Then I did a little more experimenting and realised that DNS lookups
in Go don't transparently handle the punycoding like Python does, and I
was quite disappointed in Go for that.  Then I started reading through
all the mailing list posts, and realised that people weren't even upset
so much about punycoding IDNs as they were about processing IRIs to
e.g. absolutise relative IRIs or add queries.  This was considered to
require complex third party libraries in most languages.  I was kind
of baffled by this because doing this kind of operation with IRIs is
not substantially different from doing it with URIs (as Sean has shown
by actually implementing it) and I couldn't believe that something
so trivial wouldn't be well handled by standard libraries in 2020
(and, actually, based on some people's posts to the ML it seems like
it often is).  At this point my attitude became:

"Wow, the uptake of these standardised i18n tools in major programming
languages is nothing short of embarrassing.  I would be in favour
of defining Gemini as using IRLs not URLs, but when e.g. clients
written in Go fail to "just work" with these, we do not blame the
client authors and ask them to move mountains to work around the
deficiencies of their standard libraries, but blame the language
implementers.  Over time, surely, the existing DNS and URI libraries
will all be updated to follow the new standards, and those "broken"
clients will suddenly become "working" clients without their authors
even having to do anything.  It's unfortunate that there will be a
transitional period where the Gemini spec is somewhat "aspirational"
and some clients necessarily fall short due to the failings of others,
but that's better than leaving things as is and having Gemini be
forever broken with regards to internationalisation."

Then I followed the mailing list threads yet deeper, and reached
the point where Jason pointed out that RFC 3987 is only a proposed
standard, that it has effectively been abandoned by the IETF,
and that now the W3C has its own alternative standardisation of
"URL" under active development, which is "extremely WWW-centric"
(I'm taking Jason's word for this, I haven't actually looked into the
details of this yet).  This completely undermines my attitude above,
because it makes it much less likely that standard libraries will
ever be uniformly upgraded to handle IRIs correctly, and it means
we can't take the simple moral highground of saying that the Gemini
spec is based on IETF standards and it's not our fault if standard
libraries still need to lift their game to reflect those standards.

Now I honestly don't know what to think.  It has always been a core
tenet of Gemini's design that it is made by joining together mature,
widely-implemented IETF standards in simple ways, so that no heavy
lifting is required to build Gemini software in almost any language
on any almost platform because all the parts are "radically familiar".
I'm very reluctant to move away from that ideal, it's one of our core
strengths.  But I also think localisation is important and, within
reason, I buy the argument that there's a moral obligation to at least
seriously try to fix this, and the fact that other technology stacks
like the web have not is no excuse for us to do them same when we have
the opportunity to make a fresh start.  But these two principles are in
hard conflict.  There apparently *are* no mature, widely-implemented
IETF standards to handle non-ASCII URLs.  This sucks, and I really wish
it were otherwise, but I (and we, the Gemini community) are,
realistically, absolutely powerless to change this, no matter how much
we might like to.

But, *something* has to be decided.  All we can really do is be
pragmatic: consider how much pain is required to get some support
for internationalised addressing into Gemini, and consider who has
to bear that pain.  Ideally, we try to minimise the total amount
of pain, and preferentially inflict more pain on software authors
than on content authors (who are not necessarily developers or even
"power users"), and more pain on server authors than on client authors
(it's of more benefit to more people for it to be easy to roll your
own client than for it to be easy to roll your own server).

The options, then, would appear to be:

1. Nothing changes in the spec (except we remove the language
about "UTF-8 encoded URLs" because this is, frankly, a recipe for
misunderstanding).  Gemini runs entirely on URLs using only a subset
of ASCII.  Clients and servers are permitted to be highly "dumb" in
this regard, and no existing software breaks.  Ultimate responsibility
for internationalised links falls to content authors, who are obligated
to fully punycode and percent-encode all their links so they are
valid URIs, and if they do this wrong their links don't work and
it's nobody's fault but their own, and if they don't understand what
any of that even *means* they are forced to use ASCII URLs instead.
Client authors who want to be i18n friendly can visually present these
links as IRIs if they're up to it, and accept IRIs in the address
bar (or equivalent) and encode them before doing name lookups or
sending requests.  This voluntary extra complexity requires being
able to do punycoding and percent-encoding in both the forward and
backward directions.

2. We stick to ASCII-only URLs in Gemini requests, but allow IRIs in
text/gemini and require all clients to be able to suitably encode
IRIs before doing name lookups or sending requests, and to accept
IRIs in the address bar.  Content authors just write their content
in their ordinary editor in an ordinary human-readable way without
knowing what punycode or perecent enoding are.  All client authors
need to be able to do punycoding and percent-encoding in the forward
direction only.  If no standard library support for this is available,
these operations need to be done from scratch.

3. We treat RFC 3987 as a first-class entity in our world, even if
the IETF has abandoned it.  IRIs are used everywhere, in text/gemini
documents and in requests.  Nobody ever has to do percent encoding
in any direction (beyond what is already required for standard URIs).
The forward punycoding requirement remains as per 2. above.  However,
instead of having to do forward percent encoding, clients now need to
be able to do things like absolutise relative IRIs.  If no standard
library support for this is available, this needs to be done from
scratch - although, note that if standard library support for percent
encoding forward and backwards is present, then the standard library
support for relativising ASCII URLs, which we are basically already
asuming is present everywhere, is sufficient to build this up, so
this is not anywhere near as scary as it seems.  There's an addition
wrinkle here in that unicode normalisation needs to be consistent
between e.g. the client and server's idea of the domain name.
This could, I think be made entirely the server's responsibility,
by requiring servers to normalise requests in a particular way.

Obviously, option 1. is preferable from the point of view of a spec
author or a software implementer, but it has to be acknowledged
that it throws international content authors under the bus (it's
true there are such authors on the ML who are happily doing exactly
what this option requires, but we need to acknowledge that people
who can converse in technical English about protocol design on a
mailing list are not a representative sample!).  From the point of
view of international content authors, 2. and 3. are equivalent.
It's true that this problem could be minimised by the availability
of servers which transform text/gemini content on the fly, and it's
true that historically I've been happiest dumping extra complexity
on server authors, but I'm not sure this is ideal - users might move
their content between hosts and suddenly have their links break,
which will seem mysterious to them.

Regarding options 2. and 3., from a strictly conceptual/aesthetic
perspective, 3. is clearly perferable.  It's much nicer not to have to
map back and forth between a user's perspective of what addresses look
like and a machine's perspective, but to use the same representation
for both.  The less client-side munging of what's in a link line,
the better.  And following an *absolute* IRI is actually easier under
option 3. than under option 2, because it preserves the beautifully
simple idea that to follow a link, you just send the corresponding
server exactly what you find in the document, not some transformation
of it or some subpart of it.  A text/gemini link line is, in fact, a
ready-to-use request with a label on it!

But we need to consider the implementation burden.  Both 2. and
3. require exactly the same punycoding before DNS lookup (and I still
hope this will become more and more transparently handled by standard
libraries over time), so it comes down to what's more widely supported
and what's easiest to implement in the absence of support: percent
encoding an IRI to a URI so it can be parsed, possibly absolutised
and then sent over the wire as a purely ASCII request, or parsing
and possibly absolutising an IRI as-is before sending it as UTF-8?

It seems to have been a big point of concern on the ML that IRI parsing
is rarely supported in standard libraries and difficult to implement
from scratch, and that this totally sinks something like option 3.
But it seems to be the case that IRIs can in fact be processed with
standard tools in Python and Go, and sort-of-kinda in Java.  Of course,
that's not everywhere, but the capability doesn't exactly seem rare.
And in any environment where option 2. is easy, it seems to me that
3. could be achieved roughly as easily just by transforming an IRI
to a URI, parsing that and doing absolutisation with the standard URI
tools that we assume exist everywhere, and then translating back to an
absolute IRI in the end before sending the request.  The basic idea is
that transformation from IRI to URI happens as a last resort, only when
necessary, and the transformation is reversed as early as possible.
The extent and kind of transformation required is directly proportional
to how stubbornly ASCII-only the environment is.  There might be some
environments (seemingly Python could be one) where transformation

	never* needs to happen, and that seems better than an approach where

transformation *always* needs to happen.  So option 3. actually seems
within the realm of possibility to me, although I wouldn't want to
put my full weight behind it until some actual testing has taken place.

It's true that this would be a breaking change, although of a different
kind from other breaking changes I've pushed back against in the past.
It's not as if Geminispace would suddenly become impossible to access,
or would split into two totally incompatible subspaces based on the
old and new protocol versions.  Any currently extant Gemini document
which included ASCII-only links would remain perfectly accessible
by old and new clients alike.  So, it's a relatively soft break.
Given the importance of first-class internationalisation support,
it might be worthwhile.

Feedback welcome, especially if I've overlooked anything, which is
certainly possible.  What I'd be most interested in hearing, at this
point, is client authors letting me know whether the standard library
in the language their client is implemented in can straightforwardly:

1. Parse and relativise URLs with non-ASCII characters (so, yes, okay,
   technically not URLs at all, you know what I mean) in paths and/or
   domains?
2. Transform back and forth between URIs and IRIs?
3. Do DNS lookups of IDNs without them being punycoded first?  You can
   test this with r?ksm?rg?s.josefsson.org.

Getting good data on all three of these questions for a wide range
of languages is necessary to make a well-informed decision here.

Cheers,
Solderpunk

Link to individual message.

2. cage (cage-dev (a) twistfold.it)

📅 Sent: 2020-12-22 16:23
📧 Message 2 of 109

On Tue, Dec 22, 2020 at 04:13:06PM +0100, Solderpunk wrote:
> Hi folks,

Hi!

[...]

>
> Feedback welcome, especially if I've overlooked anything, which is
> certainly possible.  What I'd be most interested in hearing, at this
> point, is client authors letting me know whether the standard library
> in the language their client is implemented in can straightforwardly:
>
> 1. Parse and relativise URLs with non-ASCII characters (so, yes, okay,
>    technically not URLs at all, you know what I mean) in paths and/or
>    domains?

Before i can answer i need help  here, i do not know what "relativise"
means, can someone explain (maybe in simple terms ;-)).

Bye!
C.

Link to individual message.

3. Solderpunk (solderpunk (a) posteo.net)

📅 Sent: 2020-12-22 16:34
📧 Message 3 of 109

On Tue Dec 22, 2020 at 5:23 PM CET, cage wrote:

> Before i can answer i need help here, i do not know what "relativise"
> means, can someone explain (maybe in simple terms ;-)).

Whoops!  In fact, I meant "absolutise" - i.e. convert a relative URL
into an absolute URL, by using the URL where the relative URL is seen to
fill in the scheme, hostname and possibly part of the path.  Sorry for
the slip up.

Cheers,
Solderpunk

Link to individual message.

4. Petite Abeille (petite.abeille (a) gmail.com)

📅 Sent: 2020-12-22 16:54
📧 Message 4 of 109

> On Dec 22, 2020, at 16:13, Solderpunk <solderpunk at posteo.net> wrote:
> 
> Okay, I'm finally getting involved in this discussion.

Thanks for the, hmm, textwall :) Glad you are back. 

To summarize:

#1: Make ASCII Great Again. And again.
#2: Transcribe between 1 & 3
#3: Take the IRI mantel

Hopefully a fair transliteration of intent.

Clearly #3 has the most appeal.

But yes, nothing comes for free, and pragmatism may drag us back to #1.

I don't quite see what #2 is for. Midway compromise? 

Either way, let's ruminate this.

Link to individual message.

5. cage (cage-dev (a) twistfold.it)

📅 Sent: 2020-12-22 16:59
📧 Message 5 of 109

On Tue, Dec 22, 2020 at 05:34:16PM +0100, Solderpunk wrote:

Hi!

[...]

>
> Whoops!  In fact, I meant "absolutise" - i.e. convert a relative URL
> into an absolute URL, by using the URL where the relative URL is seen to
> fill in the scheme, hostname and possibly part of the path.  Sorry for
> the slip up.

No problem, i think I am able to answer now! :)

Bye!
C.

Link to individual message.

6. Solderpunk (solderpunk (a) posteo.net)

📅 Sent: 2020-12-22 17:12
📧 Message 6 of 109

On Tue Dec 22, 2020 at 5:54 PM CET, Petite Abeille wrote:

> To summarize:
>
> #1: Make ASCII Great Again. And again.
> #2: Transcribe between 1 & 3
> #3: Take the IRI mantel
>
> Hopefully a fair transliteration of intent.

Yep, that's about right. :)

> Clearly #3 has the most appeal.
>
> But yes, nothing comes for free, and pragmatism may drag us back to #1.

> I don't quite see what #2 is for. Midway compromise?

Option 2. doesn't appeal much to me either, but it seems, from my read
through of most of the ML posts in the three threads you helpfully
linked to, to be quite popular in the community, and it's also
apparently more or less what the web does, so it seemed worth listing.
Having it explicitly spelled out also makes it easy to compare exactly
how much extra work is involved in option 3. compared to this.

Cheers,
Solderpunk

Link to individual message.

7. cage (cage-dev (a) twistfold.it)

📅 Sent: 2020-12-22 17:18
📧 Message 7 of 109

On Tue, Dec 22, 2020 at 04:13:06PM +0100, Solderpunk wrote:
> Hi folks,

Hi!

[...]

> Feedback welcome, especially if I've overlooked anything, which is
> certainly possible.  What I'd be most interested in hearing, at this
> point, is client authors letting me know whether the standard library
> in the language their client is implemented in can straightforwardly:

The language i written my client with is Common lisp

> 1. Parse and relativise [absolutize] URLs with non-ASCII characters (so, yes, okay,
>    technically not URLs at all, you know what I mean) in paths and/or
>    domains?

The language has no concept of URI; IRI or even URL in the standard library.

I am aware of two free/libre libraries but in my experience both have problems.
I ended writing my custom parser for URI and IRI, that probably is broken as well. ;-)

> 2. Transform back and forth between URIs and IRIs?

Before making a  request I punycode the domain  and percent-encode the
query and fragment (should also percent-encode the path?).

Anyway there is a third party  free library to do percent-encoding and
decoding.

> 3. Do DNS lookups of IDNs without them being punycoded first?  You can
>    test this with r?ksm?rg?s.josefsson.org.

There  is library  in CL  that do  punycoding, i  wrapped a  C library
(libidn2) to do the same instead. I can resolve the domain above! :)

Bye!
C.

Link to individual message.

8. Petite Abeille (petite.abeille (a) gmail.com)

📅 Sent: 2020-12-22 17:23
📧 Message 8 of 109

> On Dec 22, 2020, at 18:18, cage <cage-dev at twistfold.it> wrote:
> 
>  (should also percent-encode the path?)

Yes. 

The individual path segments actually. 

So, given /Foo/Bar/Baz, decompose the path into individual segments Foo, 
Bar, and Baz, encode these, and reconstruct the path. Easy-peasy.

Link to individual message.

9. Stephane Bortzmeyer (stephane (a) sources.org)

📅 Sent: 2020-12-22 17:37
📧 Message 9 of 109

On Tue, Dec 22, 2020 at 04:13:06PM +0100,
 Solderpunk <solderpunk at posteo.net> wrote 
 a message of 278 lines which said:

> pointed out that RFC 3987 is only a proposed standard,

This specific point is probably irrelevant, since few people care
about the difference between "proposed standard" and "standard". HTTP
is "proposed standard", too.

See RFC 7127 for this classification.

Link to individual message.

10. Côme Chilliet (come (a) chilliet.eu)

📅 Sent: 2020-12-22 18:18
📧 Message 10 of 109

Glad to read all this, it makes a lot of sense.

I'm in full support of option 3.

In PHP from my experience parse_url can eat up any unicode I throw at it. 
I did not have to do any DNS lookup as I implemented a server and not a 
client. Percent decoding is also easy.

Link to individual message.

11. Stephane Bortzmeyer (stephane (a) sources.org)

📅 Sent: 2020-12-22 19:37
📧 Message 11 of 109

On Tue, Dec 22, 2020 at 04:13:06PM +0100,
 Solderpunk <solderpunk at posteo.net> wrote 
 a message of 278 lines which said:

> What I'd be most interested in hearing, at this point, is client
> authors letting me know whether the standard library in the language
> their client is implemented in can straightforwardly:

Tests with Python. All of this is now implemented in the Agunua tool
<https://framagit.org/bortzmeyer/agunua>.

% agunua gemini://g?meaux.bortzmeyer.org/caf?.gmi
# Du caf?

Si vous voyez cela, c'est que votre client Gemini g?re les IRI.

> 1. Parse and relativise URLs with non-ASCII characters (so, yes, okay,
>    technically not URLs at all, you know what I mean) in paths and/or
>    domains?

No problem, standard library urllib.parse.urlparse parses IRI.

> 2. Transform back and forth between URIs and IRIs?

Not directly in the standard library, but the code is simple (attached).

> 3. Do DNS lookups of IDNs without them being punycoded first?  You can
>    test this with r?ksm?rg?s.josefsson.org.

Yes, punycoding is handled by the standard library
socket.getaddrinfo. (May be a violation of RFC 6055 but I did not
search further.)

There is also this third-party package which I did not test
<https://pypi.org/project/rfc3987/>.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: convert-iri-uri.py
Type: text/x-python
Size: 1143 bytes
Desc: not available
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201222/77f2
451d/attachment.py>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: convert-uri-iri.py
Type: text/x-python
Size: 659 bytes
Desc: not available
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201222/77f2
451d/attachment-0001.py>

Link to individual message.

12. Petite Abeille (petite.abeille (a) gmail.com)

📅 Sent: 2020-12-22 20:09
📧 Message 12 of 109

> On Dec 22, 2020, at 17:54, Petite Abeille <petite.abeille at gmail.com> wrote:
> 
> #3: Take the IRI mantel

If this ever goes through, we should consider increasing the maximum 
request size to 4,096 bytes* to keep the number of characters constant.

	the current 1,028 bytes ? 4, as UTF8  can use up to four bytes per character

Link to individual message.

13. Sean Conner (sean (a) conman.org)

📅 Sent: 2020-12-22 22:18
📧 Message 13 of 109

It was thus said that the Great Petite Abeille once stated:
> 
> 
> > On Dec 22, 2020, at 16:13, Solderpunk <solderpunk at posteo.net> wrote:
> > 
> > Okay, I'm finally getting involved in this discussion.
> 
> Thanks for the, hmm, textwall :) Glad you are back. 
> 
> To summarize:
> 
> #1: Make ASCII Great Again. And again.
> #2: Transcribe between 1 & 3
> #3: Take the IRI mantel
> 
> Hopefully a fair transliteration of intent.
> 
> Clearly #3 has the most appeal.
> 
> But yes, nothing comes for free, and pragmatism may drag us back to #1.
> 
> I don't quite see what #2 is for. Midway compromise? 

  1. Status quo
  2. Clients take the hit (have to support both URL and IRI)
  3. Clients and servers take the hit (both have to support URL and IRI)

  -spc

Link to individual message.

14. Petite Abeille (petite.abeille (a) gmail.com)

📅 Sent: 2020-12-22 22:21
📧 Message 14 of 109



> On Dec 22, 2020, at 23:18, Sean Conner <sean at conman.org> wrote:
> 
>  3. Clients and servers take the hit (both have to support URL and IRI)

This being a very equalitarian commune, I say this sounds fair: everybody 
"take the hit" for the greater good.

Link to individual message.

15. Philip Linde (linde.philip (a) gmail.com)

📅 Sent: 2020-12-22 23:09
📧 Message 15 of 109

On Tue, 22 Dec 2020 16:13:06 +0100
"Solderpunk" <solderpunk at posteo.net> wrote:

> 1. Parse and relativise URLs with non-ASCII characters (so, yes, okay,
>    technically not URLs at all, you know what I mean) in paths and/or
>    domains?
> 2. Transform back and forth between URIs and IRIs?

I am using Go, which will do these things as you mentioned.

Output from net/url:

  gemini://r?ksm?rg?s.example.com:3131/?????/hej/hopp??=?#???
  Scheme: gemini
  Path: /?????/hej/hopp
  EscapedPath: /%C3%A5%C3%A4%C3%B6%C3%BC%C3%BF/hej/hopp
  RawQuery: ?=?
  Hostname: r?ksm?rg?s.example.com
  Port: 3131
  RawFragment: ???
  EscapedFragment: %C3%A7%C3%A7%C3%A7

> 3. Do DNS lookups of IDNs without them being punycoded first?  You can
>    test this with r?ksm?rg?s.josefsson.org.

Go won't do this automatically as mentioned, but there is an
experimental standard library project golang.org/x/net/idna that can
assist. I think that this is the best approach; the use of IDNA is
application dependent and IMO shouldn't be done automatically at such a
low level.

Note that for Python, Python 3.x will correctly resolve as per your
example, but Python 2.x will not. Python 3 also doesn't support
IDNA2008 (see https://bugs.python.org/issue17305), which is slightly
incompatible with IDNA2003. There is a third party library that
supports IDNA2008. As a last resort, client authors should be able to
link to e.g. Libidn2, license permitting.

In my case the problem with implementing IDNA is not in my application.
My client is a browser plugin. The browser (Dillo) doesn't support IDN
and development is pretty slow on their end. My plugin inherits this
limitation.

Even then, I am for option #1 personally. IDN/IRI are presentational
problems which I think should be left to the client. IDN/IRI in
text/gemini for authors can be solved with tooling, but I am not sure
that's desirable. I've attached the source code to a text/gemini
formatter that "un-internationalizes" IRIs in a text/gemini document
passed on stdin anyway...discovered an HTTP-ism in net/url along the
way :)

-- 
Philip
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gmifmt.go
Type: application/octet-stream
Size: 1733 bytes
Desc: not available
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201223/b244
c0ac/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 488 bytes
Desc: not available
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201223/b244
c0ac/attachment.sig>

Link to individual message.

16. bie (bie (a) 202x.moe)

📅 Sent: 2020-12-23 00:41
📧 Message 16 of 109

On Tue, Dec 22, 2020 at 11:21:38PM +0100, Petite Abeille wrote:
> 
> 
> > On Dec 22, 2020, at 23:18, Sean Conner <sean at conman.org> wrote:
> > 
> >  3. Clients and servers take the hit (both have to support URL and IRI)
> 
> This being a very equalitarian commune, I say this sounds fair: 
everybody "take the hit" for the greater good. 

Everyone who stays takes the hit, that is.

My "threshold" for complexity is "can I write a conforming, relatively
strict and safe server only relying on the OpenBSD base system". It's
kind of arbitrary, sure, but what isn't.

Mandatory IRI support (and there's no real way to keep it optional,
considering queries) would push the implementation complexity a step too
far for me.

bie

Link to individual message.

17. Petite Abeille (petite.abeille (a) gmail.com)

📅 Sent: 2020-12-23 01:02
📧 Message 17 of 109

> On Dec 23, 2020, at 01:41, bie <bie at 202x.moe> wrote:
> 
> My "threshold" for complexity is "can I write a conforming, relatively
> strict and safe server only relying on the OpenBSD base system". It's
> kind of arbitrary, sure, but what isn't.

Fair enough. But what's the showstopper really?

Not sure what the "base system" contains, nor the level at which you 
interact with it, but it lists Perl as one of its component. Which could handle IRIs*.

But if strict ASCII is all what "OpenBSD base system" can do, ever, then so be it. 

On the other hand, one can always, you know, write such IRI parser on 
their own. It has been done before. There must be a C compiler somewhere 
in that base system, no?

	https://metacpan.org/pod/IRI

Link to individual message.

18. bie (bie (a) 202x.moe)

📅 Sent: 2020-12-23 01:37
📧 Message 18 of 109

On Wed, Dec 23, 2020 at 02:02:06AM +0100, Petite Abeille wrote:
> 
> 
> > On Dec 23, 2020, at 01:41, bie <bie at 202x.moe> wrote:
> > 
> > My "threshold" for complexity is "can I write a conforming, relatively
> > strict and safe server only relying on the OpenBSD base system". It's
> > kind of arbitrary, sure, but what isn't.
> 
> Fair enough. But what's the showstopper really?
> 
> Not sure what the "base system" contains, nor the level at which you 
interact with it, but it lists Perl as one of its component. Which could handle IRIs*.
> 
> But if strict ASCII is all what "OpenBSD base system" can do, ever, then so be it. 
> 
> On the other hand, one can always, you know, write such IRI parser on 
their own. It has been done before. There must be a C compiler somewhere 
in that base system, no?
> 
> * https://metacpan.org/pod/IRI

Should have specified the language (C), too. I'm not going to be pulling
in perl, and writing a full-fledged IRI parser from scratch in C sounds
profoundly uncomfortable.

In any case, it's not about what's possible, just a purely personal
opinion about where gemini gets too complex to be fun. I'm not expecting
anyone to share my exact preferences, just putting it out there as a
single anecdotal data point (from someone who so far has been serving
mostly non-ascii content over gemini with no real problems or complaints)

bie

Link to individual message.

19. Petite Abeille (petite.abeille (a) gmail.com)

📅 Sent: 2020-12-23 01:49
📧 Message 19 of 109


> On Dec 23, 2020, at 02:37, bie <bie at 202x.moe> wrote:
> 
> Should have specified the language (C), too. I'm not going to be pulling
> in perl, and writing a full-fledged IRI parser from scratch in C sounds
> profoundly uncomfortable.

Fair enough. And libcurl is of no help either? Or HTParse.c?

> in any case, it's not about what's possible, just a purely personal
> opinion about where gemini gets too complex to be fun.

Ok. Different pain thresholds I guess.

Link to individual message.

20. Petite Abeille (petite.abeille (a) gmail.com)

📅 Sent: 2020-12-23 01:53
📧 Message 20 of 109



> On Dec 23, 2020, at 02:49, Petite Abeille <petite.abeille at gmail.com> wrote:
> 
> And libcurl is of no help either? Or HTParse.c?

Or https://uriparser.github.io , no use either?

Link to individual message.

21. Petite Abeille (petite.abeille (a) gmail.com)

📅 Sent: 2020-12-23 02:10
📧 Message 21 of 109

> On Dec 23, 2020, at 02:37, bie <bie at 202x.moe> wrote:
> 
>  (from someone who so far has been serving
> mostly non-ascii content over gemini with no real problems or complaints)

Got to say, I don't get it. 

You have a perfectly functional gemini server running on openbsd, 
handcrafted in C, serving unicode content without a fuss, handling URIs, 
and the whole shebang, but suddenly IRIs push you over the brink?!

This doesn't add up.

But ok. To each their own.

Link to individual message.

22. bie (bie (a) 202x.moe)

📅 Sent: 2020-12-23 02:54
📧 Message 22 of 109

On Wed, Dec 23, 2020 at 03:10:58AM +0100, Petite Abeille wrote:
> 
> 
> > On Dec 23, 2020, at 02:37, bie <bie at 202x.moe> wrote:
> > 
> >  (from someone who so far has been serving
> > mostly non-ascii content over gemini with no real problems or complaints)
> 
> Got to say, I don't get it. 
> 
> You have a perfectly functional gemini server running on openbsd, 
handcrafted in C, serving unicode content without a fuss, handling URIs, 
and the whole shebang, but suddenly IRIs push you over the brink?!
> 
> This doesn't add up.

Why doesn't it add up?

My server doesn't have to know anything about unicode to serve a text
file, just like it doesn't have to be able to parse JPEGs to serve
images. IRIs means it *does* have to know something about unicode, which
ucs characters are valid IRI characters, that the "private" UCS are only
valid in the query part etc etc.

bie

Link to individual message.

23. Petite Abeille (petite.abeille (a) gmail.com)

📅 Sent: 2020-12-23 08:45
📧 Message 23 of 109



> On Dec 23, 2020, at 03:54, bie <bie at 202x.moe> wrote:
> 
> My server doesn't have to know anything about unicode to serve a text
> file, just like it doesn't have to be able to parse JPEGs to serve
> images. IRIs means it *does* have to know something about unicode, which
> ucs characters are valid IRI characters, that the "private" UCS are only
> valid in the query part etc etc.

Ok, so Unicode again. Fair enough. Where is Plan9 when you need it. Sigh.

Link to individual message.

24. Petite Abeille (petite.abeille (a) gmail.com)

Subject Changed! New Subject: Plan9? (was Re: [spec] IRIs, IDNs, and all that international jazz)
📅 Sent: 2020-12-23 09:33
📧 Message 24 of 109

> On Dec 23, 2020, at 09:45, Petite Abeille <petite.abeille at gmail.com> wrote:
> 
> Ok, so Unicode again. Fair enough. Where is Plan9 when you need it. Sigh.

While at it, anyone running gemini on Plan9? 

https://9p.io/plan9/

Or perhaps even using Plan 9 from User Space somehow for gemini?

https://9fans.github.io/plan9port/

Link to individual message.

25. roy niang (roy (a) royniang.com)

Subject Changed! New Subject: Re: Plan9? (was Re: [spec] IRIs, IDNs, and all that international jazz)
📅 Sent: 2020-12-23 09:34
📧 Message 25 of 109

I read that molly brown works on 9front.

Link to individual message.

26. Julien Blanchard (julien (a) typed-hole.org)

Subject Changed! New Subject: Plan9? (was Re: [spec] IRIs, IDNs, and all that international jazz)
📅 Sent: 2020-12-23 09:47
📧 Message 26 of 109


> Le 23 d?c. 2020 ? 10:33, Petite Abeille <petite.abeille at gmail.com> a ?crit :
> 
> While at it, anyone running gemini on Plan9?
> 
> https://9p.io/plan9/
> 
> Or perhaps even using Plan 9 from User Space somehow for gemini?
> 
> https://9fans.github.io/plan9port/

Yes! gemini://9til.de is powered by Molly Brown on 9front (a plan9 fork).

?
julienxx

Link to individual message.

27. marc (marcx2 (a) welz.org.za)

Subject Changed! New Subject: [spec] IRIs, IDNs, and all that international jazz
📅 Sent: 2020-12-23 10:00
📧 Message 27 of 109

Hi

> My "threshold" for complexity is "can I write a conforming, relatively
> strict and safe server only relying on the OpenBSD base system". It's
> kind of arbitrary, sure, but what isn't.
> 
> Mandatory IRI support (and there's no real way to keep it optional,
> considering queries) would push the implementation complexity a step too
> far for me.

TLDR:

I am with bie on the matter. Option 3 is a bridge
too far for me too.

Wall Of Text:

So I value the decency which wants to include all 
human languages in the gemini ecosystem.

But in an effort to be inclusive in one dimension
one ends up being exclusive in another dimension,
namely in the space of computer languages/host
operating systems.

It is one thing to find full I8N support in a language
such as python (slow batteries included), but what
about minorities such tcl, lua, m4 or sed ?

Protocols define the interactions between
computers. Computers don't speak any human language
all, they are programmed in computer languages. And so
it strikes me as weird to embed the (combinatorial)
complexity of human languages deep in the protocol
stack, but risk excluding niche computer languages or
operating system... which in some cases are just
one man efforts.

While the OSI 7 layer network model has its
deficiencies ("all models are wrong, some are
useful"), it does help us think about a network,
from inconvenienced electrons at the lowest layer to
high level abstractions at the top.

I think internationalisation concern belong in the
very highest level of a stack. You expect me to
say presentation or application-level, but remember
the OSI model is wrong (For instance, things like HTTP
or gemini are typically lumped into one application
layer, when there many layers to them). The actual
highest level is the naive computer uses who gets told
to "move the mouse over this and then click on this,
like so...". At that level, it might make sense for
a gemini browser to be fully localised, and render an
url in the local language (maybe even left to right,
or top to bottom).

But even the layer just below that (the competent user
level) this starts leaking. A gemini url starts with
"gemini://" - that is ascii text, and even funnier,
taken from latin. If a non-english user is confused by
english (nay, latin, with no native speakers at all)
words, then surely "gemini://" has to be rewritten as
"tweling://" or "zwilling://" or whatever farsi,
japanese or mongolian use for "twin". If not, then an
full ascii text url should be manageable too... an
url is primarily a computer address.

Long ago I came across a version of (I think it was) Pascal
had been localised into french with language keywords
like "begin" and "if" replaced. I am sure somebody can
justify this somehow, but I thought this was an impediment
to interoperability, and view the internationalising
of computer protocols (as opposed to the user interfaces) 
in a similar way.

regards

marc

Link to individual message.

28. Petite Abeille (petite.abeille (a) gmail.com)

Subject Changed! New Subject: Plan9? (was Re: [spec] IRIs, IDNs, and all that international jazz)
📅 Sent: 2020-12-23 10:45
📧 Message 28 of 109



> On Dec 23, 2020, at 10:47, Julien Blanchard <julien at typed-hole.org> wrote:
> 
> Yes! gemini://9til.de is powered by Molly Brown on 9front (a plan9 fork).

Wicked! In a very good way :)

Any client side niftiness?

Link to individual message.

29. Julien Blanchard (julien (a) typed-hole.org)

📅 Sent: 2020-12-23 11:02
📧 Message 29 of 109


> Le 23 d?c. 2020 ? 11:45, Petite Abeille <petite.abeille at gmail.com> a ?crit :
> 
> Wicked! In a very good way :)
> 
> Any client side niftiness?

Of course, there are (to my knowledge) gemnine 
https://git.sr.ht/~ft/gemnine and my own castor9 https://git.sr.ht/~julienxx/castor9

?
julienxx
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201223/1b25
a8ca/attachment.htm>

Link to individual message.

30. Petite Abeille (petite.abeille (a) gmail.com)

Subject Changed! New Subject: [spec] IRIs, IDNs, and all that international jazz
📅 Sent: 2020-12-23 11:18
📧 Message 30 of 109

> On Dec 23, 2020, at 11:00, marc <marcx2 at welz.org.za> wrote:
> 
> TLDR:
> 
> I am with bie on the matter. Option 3 is a bridge
> too far for me too.
> 
> Wall Of Text:

Actually, I do have sympathy for your position. And, yes, you are 
technically correct. The best kind of correct. No arguments here.

Unicode is a big pill to swallow. This is why we have been stuck with 
ASCII for so long. And yes, technically, everything can be transcoded back 
and forth between ASCII and The World.

Machines talking to machines. 

But, personally, I think this is missing the bigger picture about what Gemini is about.

It's not purely a technical endeavor. 

After all -as people keep pointing out ad nauseam- if you want 
gopher/http/whatnot, you know where to find them.

Gemini has a humanistic stride to it. Some poetry, dare I say. Esthetics 
matters. A human touch matters.

This is why Unicode matters. It's the human face of a technology. This 
counts for something.

As someone used to say: "Technology alone is not enough". This should 
strike a cord with a community rooted in gopher, of all things. Gopher is 
not a "technology", it's a community, in the best sense of the term: 
people talking and sharing with other people. An exchange of ideas.

This is what Gemini cares about: people. Not technology. Even if 
technology is necessary to achieve its humanistic goals.

It's therefore my opinion that technologists like us should make the extra 
effort to make our technology  as human friendly as possible. Even if this 
cost us something. We can do it. For the community.

?The details are not the details; they are the product?
-- Charles and Ray Eames

Link to individual message.

31. Petite Abeille (petite.abeille (a) gmail.com)

Subject Changed! New Subject: Plan9? (was Re: [spec] IRIs, IDNs, and all that international jazz)
📅 Sent: 2020-12-23 11:28
📧 Message 31 of 109



> On Dec 23, 2020, at 12:02, Julien Blanchard <julien at typed-hole.org> wrote:
> 
> 
>> Le 23 d?c. 2020 ? 11:45, Petite Abeille <petite.abeille at gmail.com> a ?crit :
>> 
>> Wicked! In a very good way :)
>> 
>> Any client side niftiness?
> 
> Of course, there are (to my knowledge) gemnine 
https://git.sr.ht/~ft/gemnine and my own castor9 https://git.sr.ht/~julienxx/castor9
> 

Double wicked! Always had a soft spot for plan9 :)

Link to individual message.

32. Dmitry Bogatov (gemini#lists.orbitalfox.eu#v1 (a) kaction.cc)

Subject Changed! New Subject: [spec] IRIs, IDNs, and all that international jazz
📅 Sent: 2020-12-23 11:54
📧 Message 32 of 109

On Wed, Dec 23, 2020 at 11:54:16AM +0900, bie wrote:
> My server doesn't have to know anything about unicode to serve a text
> file, just like it doesn't have to be able to parse JPEGs to serve
> images. IRIs means it *does* have to know something about unicode,
> which ucs characters are valid IRI characters, that the "private" UCS
> are only valid in the query part etc etc.

Exactly. Do not push insance complexity of Unicode on everybody,
including those who do not need it. Good thing about current state of
affairs is that server can treat unicode as opaque bytestring, and
client does not need to be aware of unicode either: to locate links
plain

	strstr(gmi, "=>")

is enough, and client can just dump response to stdout, and let terminal
driver to deal with that. Or not deal.

Anything but option #1 is too much complexity in my opinion.

By the way, I really don't understand all this fuss about Unicode links.
Seriously, why? We have

	$ rm --recursive --no-preserve-root /*

for generations, and nobody bothered to "internalized" it into something
like

	$ ?? --?????????? --??-?????????-?????? /*

Link to individual message.

33. Petite Abeille (petite.abeille (a) gmail.com)

📅 Sent: 2020-12-23 12:01
📧 Message 33 of 109

> On Dec 23, 2020, at 12:54, Dmitry Bogatov <gemini#lists.orbitalfox.eu#v1 
at kaction.cc> wrote:
> 
> By the way, I really don't understand all this fuss about Unicode links.

That's perfectly fine. As adults, we should be able to hold and comprehend 
two divergent ideas at the same time. We can agree to disagree. 

It's not about who is "right". Both side are correct. They just have different values.

It's a choice. That's all.

Link to individual message.

34. Shawn Nock (shawn (a) provisoire.ca)

Subject Changed! New Subject: Plan9? (was Re: [spec] IRIs, IDNs, and all that international jazz)
📅 Sent: 2020-12-23 13:38
📧 Message 34 of 109

>> On Dec 23, 2020, at 09:45, Petite Abeille <petite.abeille at gmail.com> wrote:
> While at it, anyone running gemini on Plan9? 

gemini://provisoire.ca/ (quite new, little content) is hosted on Plan9
via rc-gemd and I use castor9 as my primary client.


S
-- 
Shawn Nock <shawn at provisoire.ca>

Link to individual message.

35. mbays (a) sdf.org (mbays (a) sdf.org)

Subject Changed! New Subject: [spec] IRIs, IDNs, and all that international jazz
📅 Sent: 2020-12-23 14:00
📧 Message 35 of 109


	Tuesday, 2020-12-22 at 16:13 +0100 - Solderpunk <solderpunk at posteo.net>:


>What I'd be most interested in hearing, at this point, is client 
>authors letting me know whether the standard library in the language 
>their client is implemented in can straightforwardly:
>
>1. Parse and relativise URLs with non-ASCII characters (so, yes, okay,
>   technically not URLs at all, you know what I mean) in paths and/or
>   domains?
>2. Transform back and forth between URIs and IRIs?
>3. Do DNS lookups of IDNs without them being punycoded first?  You can
>   test this with r?ksm?rg?s.josefsson.org.

I've looked into the situation in Haskell. It isn't nearly as good as 
I'd expected. The standard uri library 'network-uri' is strictly 3986. 
There is an 'iri' library, but it isn't widely used and doesn't seem to 
be very actively maintained: I can't even get it to install with recent 
ghc (ghc-8.8.4). It only deals with parsing and rendering, afaict 
there's no normalisation or "absolutising", nor anything on transforming 
between URIs and IRIs.

As for question 3, the answer appears to be no. In ghci:
> :set -package network
package flags have changed, resetting and loading new packages...
> import Network.Socket
> getAddrInfo (Just $ defaultHints {addrSocketType = Stream}) (Just 
"r?ksm?rg?s.josefsson.org") (Just "1965")

	** Exception: Network.Socket.getAddrInfo (called with preferred socket 

type/protocol: AddrInfo {addrFlags = [], addrFamily = AF_UNSPEC, 
addrSocketType = Stream, addrProtocol = 0, addrAddress = 0.0.0.0:0, 
addrCanonName = Nothing}, host name: Just 
"r\228ksm\246rg\229s.josefsson.org", service name: Just "1965"): does not 
exist (Name or service not known)

So library support isn't perfect. However: converting between 
utf8-encoded IRIs and URIs seems pretty trivial to implement by hand 
(Step 2 in section 3.1 of the rfc, and its inverse), and there are 
punycode implementations in standard haskell libraries (e.g. in the 
'encoding' package), so I am not at all scared by option 3. I'd just 
convert IRIs to URIs for internal use and manipulation, then convert 
back when displaying, and punycode when making requests. I'm not sure 
I'm not being naive here -- someone please explain the subtleties (or 
tell me to read the existing threads on this more carefully) if so!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 195 bytes
Desc: not available
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201223/ee90
c268/attachment.sig>

Link to individual message.

36. Jacob Moody (moody (a) posixcafe.org)

Subject Changed! New Subject: Plan9? (was Re: [spec] IRIs, IDNs, and all that international jazz)
📅 Sent: 2020-12-23 15:16
📧 Message 36 of 109

On Dec 23, 2020, at 09:45, Petite Abeille <petite.abeille at gmail.com> wrote:
> While at it, anyone running gemini on Plan9?

Yes my site[0] is run using my own rc-gemd[1] on 9front. For a client I
generally use gemnine. It should be possible to use rc-gemd from
plan9port but you would need some sort of UNIX tlsserver and aux/listen1.

On 12/23/20 7:38 AM, Shawn Nock wrote:
> gemini://provisoire.ca/ (quite new, little content) is hosted on Plan9
> via rc-gemd and I use castor9 as my primary client.

I am very happy you were able to make use of rc-gemd ?

Cheers,
Moody

[0] gemini://posixcafe.org
[1] http://shithub.us/git/moody/rc-gemd/HEAD/info.html

Link to individual message.

37. Jason McBrayer (jmcbray (a) carcosa.net)

Subject Changed! New Subject: [spec] IRIs, IDNs, and all that international jazz
📅 Sent: 2020-12-23 16:05
📧 Message 37 of 109

"Solderpunk" <solderpunk at posteo.net> writes:

Answering for Common Lisp, as best I know (I'm kind of a n00b). Detailed
Common Lisp spam below, skip if you are afraid of parentheses.

> 1. Parse and relativise URLs with non-ASCII characters (so, yes, okay,
>    technically not URLs at all, you know what I mean) in paths and/or
>    domains?

No URI handling in the standard library, but quicklisp has libraries for
it. I'm using quri, which I think is the most used, and it seems to be
fine.

CL-USER> (defparameter *my-iri* (quri:uri "gemini://r?ksm?rg?s.josefsson.org/?/?.gmi"))

	MY-IRI*

CL-USER> *my-iri*
#<QURI.URI:URI gemini://r?ksm?rg?s.josefsson.org/?/?.gmi>

CL-USER> (quri:uri-domain *my-iri*)
"josefsson.org"

CL-USER> (quri:uri-authority *my-iri*)
"r?ksm?rg?s.josefsson.org"

CL-USER> (quri:uri-path *my-iri*)
"/?/?.gmi"

CL-USER> (quri:uri-query *my-iri*)
NIL

CL-USER> (setf (quri:uri-path *my-iri*) "?/?.gmi")
"?/?.gmi"

CL-USER> (quri:uri-path *my-iri*)
"?/?.gmi"

> 2. Transform back and forth between URIs and IRIs?

Using idna package in quicklisp on the hostname:

CL-USER> (idna:to-ascii (quri:uri-authority *my-iri*))
"xn--rksmrgs-5wao1o.josefsson.org"

CL-USER> (idna:to-unicode (idna:to-ascii (quri:uri-authority *my-iri*)))
"r?ksm?rg?s.josefsson.org"

And URL-encoding on the path:

CL-USER> (quri:url-encode (quri:uri-path *my-iri*))
"%F0%9F%90%87%2F%F0%9F%90%B0.gmi"

And decoding the path:

CL-USER> (quri:url-decode (quri:url-encode (quri:uri-path *my-iri*)))
"?/?.gmi"

I will note, however, that (quri:url-decode "?/?.gmi") produces
garbage, which means on the server I can't use the library to fix up the
space in "?/?%20?.gmi" when getting the filename for the IRI, and
will have to write a unicode-safe function to handle decoding just
IRI reserved characters.

Putting these together into a function and handling edge-cases is
something I'll do if it turns out I have to.

> 3. Do DNS lookups of IDNs without them being punycoded first?  You can
>    test this with r?ksm?rg?s.josefsson.org.

The CL standard library is actually so old it doesn't have
sockets/gethostbyname. But everyone uses usocket, which is in quicklisp:

CL-USER> (usocket:get-host-by-name (quri:uri-authority *my-iri*))
#(178 174 241 102)

So that works, without punycoding, at least in my environment (sbcl
2.0.1, Linux 5.8.18, Fedora 33). It might be worth someone trying sbcl
on a BSD to see if their resolver behaves differently.

> Getting good data on all three of these questions for a wide range
> of languages is necessary to make a well-informed decision here.

Personally, I would be most gratified if option 3 proved to be workable.

-- 
+-----------------------------------------------------------+
| Jason F. McBrayer                    jmcbray at carcosa.net  |
| A flower falls, even though we love it; and a weed grows, |
| even though we do not love it.            -- Dogen        |

Link to individual message.

38. Petite Abeille (petite.abeille (a) gmail.com)

📅 Sent: 2020-12-23 16:26
📧 Message 38 of 109

> On Dec 23, 2020, at 17:05, Jason McBrayer <jmcbray at carcosa.net> wrote:
> 
>> Getting good data on all three of these questions for a wide range
>> of languages is necessary to make a well-informed decision here.
> 
> Personally, I would be most gratified if option 3 proved to be workable.

Adding my 2?... as a Lua aficionado -with its longstanding DIY ethos- I 
see no issues whatsoever.

Also, kudos to Sean Conner for being the standard-bearer for Lua in the Gemini space.

I'm personally always in awe at his mastery of Parsing Expression 
Grammars. A work of true beauty. 

Thanks Sean! :)

Link to individual message.

39. bie (bie (a) 202x.moe)

📅 Sent: 2020-12-23 16:31
📧 Message 39 of 109

On Tue, Dec 22, 2020 at 04:13:06PM +0100, Solderpunk wrote:
> It's true that this would be a breaking change, although of a different
> kind from other breaking changes I've pushed back against in the past.
> It's not as if Geminispace would suddenly become impossible to access,
> or would split into two totally incompatible subspaces based on the
> old and new protocol versions.  Any currently extant Gemini document
> which included ASCII-only links would remain perfectly accessible
> by old and new clients alike.  So, it's a relatively soft break.
> Given the importance of first-class internationalisation support,
> it might be worthwhile.

It's only a soft break if you don't consider the query string.

Even if all the names on my server are ASCII only, a change to IRIs
means I'll be forced to update my server to remain compatible since I
have dynamic scripts that accept content through the query string.

Solution 3 is a "hard no" for me - the increased complexity is not
something I'm willing to take on, and I'll most likely just end up
shutting down my servers.

bie

Link to individual message.

40. Jason McBrayer (jmcbray (a) carcosa.net)

📅 Sent: 2020-12-23 16:34
📧 Message 40 of 109

Sean Conner <sean at conman.org> writes:

>   1. Status quo
>   2. Clients take the hit (have to support both URL and IRI)
>   3. Clients and servers take the hit (both have to support URL and IRI)

Looking at 2, servers still have to take a hit here.

i. They need to de-punycode the hostname to compare it to configured
   virtual host names (unless virtual host names are configured in
   punycode).

ii. They need to url-decode the path in order to find matching file
    names; they have to do this already to handle reserved characters,
    though.

So the support needed for servers is possibly similar between 2 and 3.
Clients are hit a little harder by 2.

-- 
+-----------------------------------------------------------+
| Jason F. McBrayer                    jmcbray at carcosa.net  |
| A flower falls, even though we love it; and a weed grows, |
| even though we do not love it.            -- Dogen        |

Link to individual message.

41. Petite Abeille (petite.abeille (a) gmail.com)

Subject Changed! New Subject: Plan9? (was Re: [spec] IRIs, IDNs, and all that international jazz)
📅 Sent: 2020-12-23 16:43
📧 Message 41 of 109



> On Dec 23, 2020, at 14:38, Shawn Nock <shawn at provisoire.ca> wrote:
> 
> gemini://provisoire.ca/ 

Lovely domain name.

 /provisoire/ adjectif Qui existe, se fait en attendant autre chose, ou d'?tre remplac?.

Link to individual message.

42. Petite Abeille (petite.abeille (a) gmail.com)

Subject Changed! New Subject: [spec] IRIs, IDNs, and all that international jazz
📅 Sent: 2020-12-23 16:48
📧 Message 42 of 109



> On Dec 23, 2020, at 17:31, bie <bie at 202x.moe> wrote:
> 
> Solution 3 is a "hard no" for me - the increased complexity is not
> something I'm willing to take on, and I'll most likely just end up
> shutting down my servers.

It is what it is.

Link to individual message.

43. Gary Johnson (lambdatronic (a) disroot.org)

📅 Sent: 2020-12-23 20:18
📧 Message 43 of 109

Although my server is written in Clojure, I'm leveraging the Java
standard libraries in Space Age since there is little value in
reinventing the wheel here.

In Java world, URIs can be parsed and generated with java.net.URI. This
class accepts URIs with Unicode characters in the path, query, and
fragment segments. However, it will throw an exception if Unicode
characters are included in the domain name.

Conversion between Unicode and punycode can be done with java.net.IDN.

 ```
Clojure 1.10.1
user=> (import 'java.net.IDN)
java.net.IDN
user=> (IDN/toUnicode "xn--9dbne9b.com")
"????.com"
user=> (IDN/toASCII "????.com")
"xn--9dbne9b.com"
 ```

Easy peasy.

Sadly, there is no java.net.IRI.

So if we went with options 2 or 3, I would need to manually parse the
Gemini request into segments (not particularly challenging, of course).
Then I could use java.net.IDN to perform punycode-to-Unicode or
Unicode-to-punycode encoding (depending on whether we went with option 2
or 3) to perform robust virtual hostname lookups (and presumably SNI
verification as well).

Finally, I'd need to use java.net.URI to combine the punycoded domain
name back with the path, query, and fragment segments into a valid URI
that I could then parse and percent-decode without throwing an
exception.

All of this should be doable with a bit of custom logic wrapped around
the Java standard library, so I think either option 2 or 3 should be
technically feasible from my end (or for anyone else using a language
that compiles to Java bytecode).

Happy hacking,
  Gary

-- 
GPG Key ID: 7BC158ED
Use `gpg --search-keys lambdatronic' to find me
Protect yourself from surveillance: https://emailselfdefense.fsf.org
=======================================================================
()  ascii ribbon campaign - against html e-mail
/\  www.asciiribbon.org   - against proprietary attachments

Why is HTML email a security nightmare? See https://useplaintext.email/

Please avoid sending me MS-Office attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html

Link to individual message.

44. cage (cage-dev (a) twistfold.it)

📅 Sent: 2020-12-23 20:26
📧 Message 44 of 109

On Tue, Dec 22, 2020 at 06:23:51PM +0100, Petite Abeille wrote:
>
>
> > On Dec 22, 2020, at 18:18, cage <cage-dev at twistfold.it> wrote:
> >
> >  (should also percent-encode the path?)
>
> Yes.
>
> The individual path segments actually.
>
> So, given /Foo/Bar/Baz, decompose the path into individual segments Foo, 
Bar, and Baz, encode these, and reconstruct the path. Easy-peasy.

Seems simple, but i can make mess even with simple things. :)

Thank you! :)
Bye!

Link to individual message.

45. Sean Conner (sean (a) conman.org)

📅 Sent: 2020-12-23 21:59
📧 Message 45 of 109

It was thus said that the Great Solderpunk once stated:
> Feedback welcome, especially if I've overlooked anything, which is
> certainly possible.  What I'd be most interested in hearing, at this
> point, is client authors letting me know whether the standard library
> in the language their client is implemented in can straightforwardly:
> 
> 1. Parse and relativise URLs with non-ASCII characters (so, yes, okay,
>    technically not URLs at all, you know what I mean) in paths and/or
>    domains?
> 2. Transform back and forth between URIs and IRIs?
> 3. Do DNS lookups of IDNs without them being punycoded first?  You can
>    test this with r?ksm?rg?s.josefsson.org.

  For C, I'm sure there is code, somewhere, that can parse IRIs, but it's a
matter of finding them.  

  For Lua, the answers are:

	1. Yes.  I had to write some code [1][2], and modify some existing
	   code [3], but Lua now has modules to parse IRI and URIs.

	2. I can do IRI->URL, but not the other way---I have no need of a
	   URL->IRI as of yet.

	3. For my setups (systems I've been able to test), I cannot lookup
	   IDNs as is---I *have* to convert to punycode first.

  -spc

[1]	https://github.com/spc476/LPeg-Parsers/blob/master/iri.lua

[2]	https://github.com/spc476/lua-conmanorg/blob/master/src/idn.c

[3]	https://github.com/spc476/GLV-1.12556/blob/master/Lua/GLV-1/url-util.lua

Link to individual message.

46. Sean Conner (sean (a) conman.org)

📅 Sent: 2020-12-23 22:21
📧 Message 46 of 109

It was thus said that the Great bie once stated:
> 
> Should have specified the language (C), too. I'm not going to be pulling
> in perl, and writing a full-fledged IRI parser from scratch in C sounds
> profoundly uncomfortable.

  So what library or code are you using now to parse URIs?

  When I wrote my IRI parser [1] I took my existing URL parser [2], and just
  changed the unreserved rule:

	ASCII:	ALPHA / DIGIT / '-' / '.' / '_' / '~'
	UTF-8:	ALPHA / DIGIT / '-' / '.' / '_' / '~' / utf8

where 'utf8' is any character 128 or higher.  I didn't bother with
restricting the private UCS set to the query because sometimes I think RFC
authors are too concerned with theory [3] than with practice and complicate
things.

  Now the conversion of a domain name to punycode on the other hand ... I
left that to libidn.

  -spc

[1]	https://github.com/spc476/LPeg-Parsers/blob/master/iri.lua

[2]	https://github.com/spc476/LPeg-Parsers/blob/master/url.lua

[3]	A charitable way of saying "smoking crack."  I mean, RFC-822
	(written in 1982) allows:

		"Look!  I'm smoking some good stuff" (no really its good) @
		berkeley (in California) . edu

	as a valid email address!  (spaces and all) No, really, look it up!

Link to individual message.

47. Sean Conner (sean (a) conman.org)

📅 Sent: 2020-12-23 23:03
📧 Message 47 of 109

It was thus said that the Great marc once stated:
> 
> It is one thing to find full I8N support in a language such as python
> (slow batteries included), but what about minorities such tcl, lua, m4 or
> sed ?

  I have Lua covered.  I can't say for the others (other than, you really
use m4?  You are a better man than I am, Gunga Din).

> I think internationalisation concern belong in the very highest level of a
> stack. You expect me to say presentation or application-level, but
> remember the OSI model is wrong (For instance, things like HTTP or gemini
> are typically lumped into one application layer, when there many layers to
> them). The actual highest level is the naive computer uses who gets told
> to "move the mouse over this and then click on this, like so...". At that
> level, it might make sense for a gemini browser to be fully localised, and
> render an url in the local language (maybe even left to right, or top to
> bottom).

  I have to deal with the telephony network at work.  It *is* the OSI seven
layer burrito [1] and even *there* there are baked in assumptions relating
to i18n [2].  Text is limited to ASCII.  Yup.  7-bit US-ASCII it all its
glory.  Anything else requires some very nasty hacks.  Even better, there
does exist a way to relate a name to a phone number, but it's restricted to
just 15 bytes of US-ASCII.  So "Rafaella Gabriela Sarsaparilla" gets cut to
"Rafaella Gabrie".  Lovely, isn't it?

> But even the layer just below that (the competent user level) this starts
> leaking. A gemini url starts with "gemini://" - that is ascii text, and
> even funnier, taken from latin. If a non-english user is confused by
> english (nay, latin, with no native speakers at all) words, then surely
> "gemini://" has to be rewritten as "tweling://" or "zwilling://" or
> whatever farsi, japanese or mongolian use for "twin". If not, then an full
> ascii text url should be manageable too... an url is primarily a computer
> address.

  Sushi comes from Japanese, gesundheit from German, sauna from Finnish,
smorgasbord from Swedish, borscht from Russian and ketchup from China,
what's your point?  All those are perfectly cromulent (from Simpsons) words. 
Modern English sucks up words from all other languages.

  Also, what's the Japanese equivalent of 'https'?  I'm curious.

> Long ago I came across a version of (I think it was) Pascal
> had been localised into french with language keywords
> like "begin" and "if" replaced. 

  It wasn't H?stad [3], was it?  If it was, I made that up to make a point
about LISP.

  But yes, there have been several such localizations in the past for
various languages but they never caught on internationally for some reason. 
One language I heard about, Cornerstone, used a novel method for
identifiers---the visual representation was not part of the code but from a
map---change a variable name in one place, and every place that variable
appeared would change its name.  Pretty cool concept if you ask me.

  -spc

[1]	And a complete pain to work with.  Fortunately, it's becoming less
	and less of an issue as things are transitioning to the Internet,
	but the phone companies are fighting and screaming all the way.

[2]	??t?r??t????l?z?t???

[3]	http://boston.conman.org/2008/01/04.1

Link to individual message.

48. Sean Conner (sean (a) conman.org)

📅 Sent: 2020-12-23 23:05
📧 Message 48 of 109

It was thus said that the Great Petite Abeille once stated:
> 
> Also, kudos to Sean Conner for being the standard-bearer for Lua in the
> Gemini space.
> 
> I'm personally always in awe at his mastery of Parsing Expression
> Grammars. A work of true beauty.
> 
> Thanks Sean! :)

  You're welcome.

  -spc (I still seem to be the only one to have a server in Lua)

Link to individual message.

49. marc (marcx2 (a) welz.org.za)

📅 Sent: 2020-12-24 11:48
📧 Message 49 of 109

Hello

> > It is one thing to find full internationalisation support in a 
language such as python
> > (slow batteries included), but what about minorities such tcl, lua, m4 or
> > sed ?
> 
>   I have Lua covered.  I can't say for the others (other than, you really
> use m4?  You are a better man than I am, Gunga Din).

So I do use m4 - it can be quite nifty to generate latex fragments, 
but that is because latex doesn't play as nicely with pipes as 
(g)roff where one can just stream things in...

m4 doesn't strike me as that special ? Prolog and postscript
felt far more exotic to me, and web servers have been written 
in the latter...

>   I have to deal with the telephony network at work.  It *is* the OSI seven
> layer burrito [1] and even *there* there are baked in assumptions relating
> to i18n [2].  Text is limited to ASCII.  Yup.  7-bit US-ASCII it all its
> glory.  Anything else requires some very nasty hacks.  

Note how the global telephone system has made it into the furthest
corners of the planet - arguably further than the internet, and did
so without worrying about internationalisation relating to their
URL equivalents (phone numbers)...

> > But even the layer just below that (the competent user level) this starts
> > leaking. A gemini url starts with "gemini://" - that is ascii text, and
> > even funnier, taken from latin. If a non-english user is confused by
> > english (nay, latin, with no native speakers at all) words, then surely
> > "gemini://" has to be rewritten as "tweling://" or "zwilling://" or
> > whatever farsi, japanese or mongolian use for "twin". If not, then an full
> > ascii text url should be manageable too... an url is primarily a computer
> > address.
> 
>   Sushi comes from Japanese, gesundheit from German, sauna from Finnish,
> smorgasbord from Swedish, borscht from Russian and ketchup from China,
> what's your point?

The insinuation was that internationalised URLs are essential
because people who don't speak english at all might not be
able to comprehend or (if their input system is sufficiently
different) generate ascii/latin text.

And my argument is that this doesn't make sense, as 
every gemini url starts with "gemini://" which is ascii text
in a language that nobody speaks anymore. And if people can manage
to type "gemini://" then a bit more ascii in the hostname or
even path should be quite manageable too even for "people who
use scripts like arabic, chinese, devanageri, etc." to quote
another list participant.

A pity that I failed to convey this point properly -
you and I (and bie, and some others) have had a very similar
conversation on the 7th and 9th of this month (under the subject
"IDN with Gemini"), where I tried to explain my position that
I view as a language as a communications protocol and not the
property of an ethnicity or nation.

The desire to be inclusive is good, but we are deferential
to pretty recent concept/meme - the monolingual nation state,
which is say 200 or 300 years old. Before that (at least
in europe, but elsewhere too) each little region had pretty strong
regional dialect or even language (limited mobility or literacy
allows for rapid linguistic drift). People who were educated spoke
a second or third language to interact with the clergy or the palaces
far away.

In this regard having people know learn a new language to interact
with the internet isn't that much of an imposition, but a return
to the way things were... just scaled up to the size of the planet. 

> All those are perfectly cromulent (from Simpsons) words. 
> Modern English sucks up words from all other languages.

Older english does too: That's why a dead cow is beef. All
languages do, absent a (religious or state-sponsored) authority
enforcing a level of purity aka stasis. Living languages evolve.

>   But yes, there have been several such localizations in the past for
> various [programming] languages but they never caught on internationally 
> for some reason. 

Isn't that yet another hint ? That the point of a language is to
communicate, not to serve as a barrier, despite the machinations
of nationalists ?

regards

marc

Link to individual message.

50. Omar Polo (op (a) omarpolo.com)

📅 Sent: 2020-12-24 12:39
📧 Message 50 of 109

bie <bie at 202x.moe> writes:

>
> My server doesn't have to know anything about unicode to serve a text
> file, just like it doesn't have to be able to parse JPEGs to serve
> images. IRIs means it *does* have to know something about unicode, which
> ucs characters are valid IRI characters, that the "private" UCS are only
> valid in the query part etc etc.
>
> bie

I think we're in the same boat, as I have written from scratch my server
using only stuff that's in base on OpenBSD too.

Initially I was totally for option #3 (but I've that I've just finished
skimming through the RFC), but by reading your messages I was a little
scared of the consequences.

Today I did some light testing, and it seems that (IF I'm understanding
everything correctly -- please correct me otherwise) that option #3 is
actually simpler for us.

Current state of the affairs: both Lagrange (0.13.1), amfora and elpher
will encode "gemini.omarpolo.com/caf?.gmi" as
"gemini.omarpolo.com/caf%C3%A8.gmi".  Obviously open("caf%C3%A8.gmi")
fails, so my server return 51 because the actual file name is
"caf?.gmi".  I have to write code that de-encode parts of the request if
I want to serve a file named like that (spoiler: I'm not gonna write it).

With IRI: the request becomes "gemini://gemini.omarpolo.com/caf?.gmi",
so open("caf?.gmi") doesn't fail.  I think that we can continue to treat
the request as a bytestring, extract the path and try to open(2) it.

I know that what I'm proposing is a really poor-man solution, because it
doesn't matter we choose option #1, #2 or #3 as we can't really treat
the path in the URL/IRL as a bytestring and call it a day.  UNIX file
names are real bytestring with only two forbidden octet, URL/IRI
aren't.

So, if I'm not missing anything, I'm all in for option #3.

Link to individual message.

51. bie (bie (a) 202x.moe)

📅 Sent: 2020-12-24 13:36
📧 Message 51 of 109

On Thu, Dec 24, 2020 at 01:39:16PM +0100, Omar Polo wrote:
> I think we're in the same boat, as I have written from scratch my server
> using only stuff that's in base on OpenBSD too.
> 
> Initially I was totally for option #3 (but I've that I've just finished
> skimming through the RFC), but by reading your messages I was a little
> scared of the consequences.
> 
> Today I did some light testing, and it seems that (IF I'm understanding
> everything correctly -- please correct me otherwise) that option #3 is
> actually simpler for us.
> 
> Current state of the affairs: both Lagrange (0.13.1), amfora and elpher
> will encode "gemini.omarpolo.com/caf?.gmi" as
> "gemini.omarpolo.com/caf%C3%A8.gmi".  Obviously open("caf%C3%A8.gmi")
> fails, so my server return 51 because the actual file name is
> "caf?.gmi".  I have to write code that de-encode parts of the request if
> I want to serve a file named like that (spoiler: I'm not gonna write it).
> 
> With IRI: the request becomes "gemini://gemini.omarpolo.com/caf?.gmi",
> so open("caf?.gmi") doesn't fail.  I think that we can continue to treat
> the request as a bytestring, extract the path and try to open(2) it.
> 
> I know that what I'm proposing is a really poor-man solution, because it
> doesn't matter we choose option #1, #2 or #3 as we can't really treat
> the path in the URL/IRL as a bytestring and call it a day.  UNIX file
> names are real bytestring with only two forbidden octet, URL/IRI
> aren't.
> 
> So, if I'm not missing anything, I'm all in for option #3.

You're kind of correct in the sense that if we just treat the request as
arbitrary bytes and not as an IRI (no validation, no handling at all),
it's simple, but I don't think that's the right way to look at this
issue. Instead, it's about the complexity of proper URI handling vs
proper IRI handling. Not to mention that IRIs can still have
percent-encoded characters!

After thinking about this for a while, the biggest issue for me is that
this is a breaking change. Breaking in the sense that it breaks *every
single compliant server we already have*! If gemini, which has been
surprisingly good at resisting breaking spec changes, accepts this, I
don't see any reason to believe that it won't happen again and again,
for equally silly reasons.

bie

Link to individual message.

52. bie (bie (a) 202x.moe)

📅 Sent: 2020-12-24 13:49
📧 Message 52 of 109

On Thu, Dec 24, 2020 at 12:48:50PM +0100, marc wrote:
> The insinuation was that internationalised URLs are essential
> because people who don't speak english at all might not be
> able to comprehend or (if their input system is sufficiently
> different) generate ascii/latin text.
> 
> And my argument is that this doesn't make sense, as 
> every gemini url starts with "gemini://" which is ascii text
> in a language that nobody speaks anymore. And if people can manage
> to type "gemini://" then a bit more ascii in the hostname or
> even path should be quite manageable too even for "people who
> use scripts like arabic, chinese, devanageri, etc." to quote
> another list participant.
> 
> A pity that I failed to convey this point properly -
> you and I (and bie, and some others) have had a very similar
> conversation on the 7th and 9th of this month (under the subject
> "IDN with Gemini"), where I tried to explain my position that
> I view as a language as a communications protocol and not the
> property of an ethnicity or nation.
> 
> The desire to be inclusive is good, but we are deferential
> to pretty recent concept/meme - the monolingual nation state,
> which is say 200 or 300 years old. Before that (at least
> in europe, but elsewhere too) each little region had pretty strong
> regional dialect or even language (limited mobility or literacy
> allows for rapid linguistic drift). People who were educated spoke
> a second or third language to interact with the clergy or the palaces
> far away.
> 
> In this regard having people know learn a new language to interact
> with the internet isn't that much of an imposition, but a return
> to the way things were... just scaled up to the size of the planet. 

Just an anecdote I briefly brought up on IRC...

I briefly experimented with percent-encoded Japanese and Norwegian
addresses on some of my capsules, but quickly gave up and went back to 
pure ASCII. *Not* because typing in percent-encoded names was annoying,
but because I realized how hard it was to verbally convey my Japanese
addresses to my Norwegian friends and vice versa. The de facto
universality of ASCII might something to embrace, not something to run away
from, if we want to be serious about being inclusive.

(marc - your posts in this thread have been great.. really appreciate
them!)

bie

Link to individual message.

53. Stephane Bortzmeyer (stephane (a) sources.org)

📅 Sent: 2020-12-24 14:29
📧 Message 53 of 109

On Thu, Dec 24, 2020 at 10:36:43PM +0900,
 bie <bie at 202x.moe> wrote 
 a message of 46 lines which said:

> After thinking about this for a while, the biggest issue for me is
> that this is a breaking change. Breaking in the sense that it breaks
> *every single compliant server we already have*! If gemini, which
> has been surprisingly good at resisting breaking spec changes,
> accepts this, I don't see any reason to believe that it won't happen
> again and again,

As I explained in
<gemini://gemi.dev/gemini-mailing-list/messages/004178.gmi>, I do
not think that backward compatibility should be a goal, since Gemini
is still experimental. Once the specification is "officially" "final",
this will be different. AFAIK, it is not the case (otherwise, what
would be the point of the [spec] topic?)

To answer your question: once the spec is "officially" adopted, it
makes sense to resist changes. We are not at this stage yet.

>  for equally silly reasons.

Internationalization is certainly not a silly reason.

Link to individual message.

54. bie (bie (a) 202x.moe)

📅 Sent: 2020-12-24 14:52
📧 Message 54 of 109

On Thu, Dec 24, 2020 at 03:29:48PM +0100, Stephane Bortzmeyer wrote:
> As I explained in
> <gemini://gemi.dev/gemini-mailing-list/messages/004178.gmi>, I do
> not think that backward compatibility should be a goal, since Gemini
> is still experimental. Once the specification is "officially" "final",
> this will be different. AFAIK, it is not the case (otherwise, what
> would be the point of the [spec] topic?)

In that case you should read the first part of the current
specification:

"Although not finalised yet, further changes to the specification are
likely to be relatively small. You can write code to this
pseudo-specification and be confident that it probably won't become
totally non-functional due to massive changes next week, but you are
still urged to keep an eye on ongoing development of the protocol and
make changes as required."

Now you might consider this proposed to change to be small enough or
important enough to still make sense. I do not.

> To answer your question: once the spec is "officially" adopted, it
> makes sense to resist changes. We are not at this stage yet.
> 
> >  for equally silly reasons.
> 
> Internationalization is certainly not a silly reason.

You don't need IRIs for internationalization. So yes, it is a silly
reason.

bie

Link to individual message.

55. Stephane Bortzmeyer (stephane (a) sources.org)

📅 Sent: 2020-12-24 15:01
📧 Message 55 of 109

On Wed, Dec 23, 2020 at 11:00:58AM +0100,
 marc <marcx2 at welz.org.za> wrote 
 a message of 79 lines which said:

> So I value the decency which wants to include all 
> human languages in the gemini ecosystem.

Actually, all human *scripts*. In any case, a Gemini client or server
won't have to understand the language. (Mandatory AI in Gemini?)

> But in an effort to be inclusive in one dimension one ends up being
> exclusive in another dimension, namely in the space of computer
> languages/host operating systems.

We already do it with the mandatory TLS: some systems cannot run
Gemini (imagine a Gemini server in assembly language). 

> It is one thing to find full I8N support in a language such as
> python (slow batteries included), but what about minorities such
> tcl, lua, m4 or sed ?

Lua is not a good example since the core language is, by design,
stricly limited. Any real Lua program uses several third-party
libraries.

> And so it strikes me as weird to embed the (combinatorial)
> complexity of human languages deep in the protocol stack,

I agree but nobody suggested to force Gemini software to understand
languages, only scripts.

> But even the layer just below that (the competent user level) this
> starts leaking. A gemini url starts with "gemini://" - that is ascii
> text, and even funnier, taken from latin. If a non-english user is
> confused by english (nay, latin, with no native speakers at all)
> words, then surely "gemini://" has to be rewritten as "tweling://"
> or "zwilling://" or whatever farsi, japanese or mongolian use for
> "twin". If not, then an full ascii text url should be manageable
> too...

The Web solved the problem by making the URI scheme optional. I don't
know Gemini clients who complete the URI with "gemini://" if it's
missing but it is a possible approach.

> an url is primarily a computer address.

This is clearly false. URI are both a technical identifier (like an IP
address or an address in memory) *and* a text seen by humans and
displayed in TV ads, business cards, spoken over the phone,
etc. Unlike addresses, they have to be internationalized. (Nobody
would use the Web if HTTP URIs were really addresses.)

> Long ago I came across a version of (I think it was) Pascal had been
> localised into french with language keywords like "begin" and "if"
> replaced. I am sure somebody can justify this somehow, but I thought
> this was an impediment to interoperability, and view the
> internationalising of computer protocols (as opposed to the user
> interfaces) in a similar way.

The idea is to have much more users than page authors and much more
page authors than programmers. Internationalizing programming
languages is a different issue, since programmers are a smaller group,
of professionals.

Link to individual message.

56. John Cowan (cowan (a) ccil.org)

📅 Sent: 2020-12-25 00:08
📧 Message 56 of 109

On Thu, Dec 24, 2020 at 6:49 AM marc <marcx2 at welz.org.za> wrote:

> Note how the global telephone system has made it into the furthest
> corners of the planet - arguably further than the internet, and did
> so without worrying about internationalisation relating to their
> URL equivalents (phone numbers)...
>

As someone who grew up actually rotating a dial to enter 7, 10, or (for
international calls) 15 digits, and looking them up in a paper booklet when
I hadn't memorized them, the user experience *sucked*.  Rectangular dials
are quicker, but otherwise not that much easier to use.  You could get a
name-to-number mapping by voice if you had enough details (typically a
postal address), but that is increasingly useless except for reaching a
business.  So what we have now is a system where numbers are universal and
the associated names are purely local.

The insinuation was that internationalised URLs are essential
> because people who don't speak english at all might not be
> able to comprehend or (if their input system is sufficiently
> different) generate ascii/latin text.
>

I think that is not the point at all.  In general, anglophones don't want
URLs that are completely meaningless: domain names generally have meaning
and so do path names and file names (consider gemini://
gemini.circumlunar.space/docs/companion/robots.gmi, for example, which
tells you a lot about the document it identifies).  But if they are in the
wrong script, ??? ?? ??? ?? ?? ??? ?? ?????????? ?? ???? ?? ?????????.  In
addition, ?? k?nv?n??nz ?v tr?nzl?t?re???n ?r n?t n?s?s?rili k?ns?st?nt
bitwin pip?l or k?ntriz.

> The desire to be inclusive is good, but we are deferential
> to pretty recent concept/meme - the monolingual nation state,
> which is say 200 or 300 years old.

Nid yw pob gwlad yn defnyddio un iaith yn unig.  (Not all countries use
only one language).

> In this regard having people know learn a new language to interact
> with the internet isn't that much of an imposition, but a return
> to the way things were... just scaled up to the size of the planet.
>

In imperio Romanorum, facilis est negotiator Romanus quam Gallus sive
Germanus, because the Roman grew up knowing the language of trade.
Likewise the anglophone today.

Isn't that yet another hint ? That the point of a language is to
> communicate, not to serve as a barrier, despite the machinations
> of nationalists ?
>

'M?lin eru h?fu?einkenni ?j??anna' ? Languages are the chief distinguishing
> marks of peoples. No people in fact comes into being until it speaks a
> language of its own; let the languages perish and the peoples perish too,
> or become different peoples. But that never happens except as the result of
> oppression and distress.'
> These are the words of a little-known Icelander of the early nineteenth
> century, Sj?ra Tomas S?mundsson, He had, of course, primarily in mind the
> part played by the cultivated Icelandic language, in spite of poverty, lack
> of power, and insignificant numbers, in keeping the Icelanders in being in
> desperate times. But the words might as well apply to the Welsh of Wales,
> who have also loved and cultivated their language for its own sake (not as
> an aspirant for the ruinous honour of becoming the lingua franca of the
> world), and who by it and with it maintain their identity.

--J.R.R. Tolkien, who was the furthest thing possible from either a
nationalist or an imperialist.  This is less true in Wales than it was when
Tolkien wrote it, but the point is the same.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201224/44ff
0be6/attachment-0001.htm>

Link to individual message.

57. spinner (gemini (a) stillspinning.cc)

📅 Sent: 2020-12-25 09:02
📧 Message 57 of 109

> I briefly experimented with percent-encoded Japanese and Norwegian
addresses on some of my capsules, but quickly gave up and went back to pure
ASCII.
> *Not* because typing in percent-encoded names was annoying, but because I
realized how hard it was to verbally convey my Japanese addresses to my
Norwegian friends and vice versa. The de facto universality of ASCII might
something to embrace, not something to run away from, if we want to be
serious about being inclusive.

Verbally conveying addresses doesn't seem like a situation to optimize for;
doesn't seem to happen so often, at least in my life as a Japanese-speaking
internet user. Even among such occasions among future gemininauts, I
conjecture that, most of the time, both parties will speak Japanese and the
address can be quickly spelled out in Japanese.

For end-users, reading, following and writing links probably will be the
most likely ways you interact with URLs.

1. Read/follow links with a user-friendly name/title: If the URL is
non-ascii: Encoding of the URL may not matter much, since it will be
hidden. If the client is capable of showing the URL upon focus or
something, showing it in unicode is far more accessible that
percent-encoding
2. Read/follow links with bare URL: If the URL is non-ascii: more
accessible to be able to read the URL in its non-ascii form
3. Write links to URLs that I control: More inclusive and convenient to be
able to use and write URLs using the script that I'm used to.
4. Write links to URLs that I don't control: It'll be more
accessible/convenient to be able to write the URL in non-ascii characters.
Copying a non-ascii URL off of a web browser's address bar will probably
percent-encode it (just tried it on desktop Chrome), but I shouldn't have
to rely on such tools.

While embracing ASCII may work when we have control over URLs we read and
write, it falls short in terms of accessibility when linking to, say,
Wikipedia, which uses non-ascii page names.

If the aim is to support i18n/inclusivity as a principle/ideal/a 100%
thing, adopting standards such as IRI/IDN(/ASCII) may make sense; if the
motivation is out of practical concerns (whether people will find
themselves reading and writing non-ascii URLs a lot and we want to make
their lives easier in that case), having clients percent-encode path
components before sending requests may suffice for now..?

>From my standpoint, chances/expectations of a particular component of a URL
having non-ascii characters:

- protocol: none
- domain: 2% of the time (8.3 million IDNs [1] / total domain names 370.7
million [2]) - but, for me, nearly none in practice. I suppose it depends
on the person
- path/query/fragment: fairly often, since I use (Japanese) Wikipedia a lot

[1] https://idnworldreport.eu/ (2020 Q1)
[2] https://www.verisign.com/en_US/domain-names/dnib/index.xhtml (2020 Q3)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201225/c9e4
369d/attachment.htm>

Link to individual message.

58. bie (bie (a) 202x.moe)

📅 Sent: 2020-12-25 12:07
📧 Message 58 of 109

On Fri, Dec 25, 2020 at 01:02:32AM -0800, spinner wrote:
> Verbally conveying addresses doesn't seem like a situation to optimize for;
> doesn't seem to happen so often, at least in my life as a Japanese-speaking
> internet user. Even among such occasions among future gemininauts, I
> conjecture that, most of the time, both parties will speak Japanese and the
> address can be quickly spelled out in Japanese.

Verbally conveying URLs and usernames is a situation I find myself in
at least monthly... even more often before COVID.
When both parties speak the same language, sure, it's more or less fine,
but trying to explain an address with a character the other party has no
idea how to input is an exercise in frustration.

> For end-users, reading, following and writing links probably will be the
> most likely ways you interact with URLs.
> 
> 1. Read/follow links with a user-friendly name/title: If the URL is
> non-ascii: Encoding of the URL may not matter much, since it will be
> hidden. If the client is capable of showing the URL upon focus or
> something, showing it in unicode is far more accessible that
> percent-encoding

Agreed. This can be handled *today* by clients with no change in the
protocol.

> 2. Read/follow links with bare URL: If the URL is non-ascii: more
> accessible to be able to read the URL in its non-ascii form

Agreed. Again, this can be handled *today* by clients with no change to
the protocol.

> 3. Write links to URLs that I control: More inclusive and convenient to be
> able to use and write URLs using the script that I'm used to.
> 4. Write links to URLs that I don't control: It'll be more
> accessible/convenient to be able to write the URL in non-ascii characters.

I'd actually say it's just *slighty* more convenient. In most cases
you'll be copying and pasting the URL. If the gemini community feels that a
breaking change that increases the complexity of implementing servers

	and* clients is worth it for this slight change in convenience... well,

like I mentioned earlier, that doesn't bode well for the future.

As for inclusivity/accessibility... I just don't buy it. Completely
non-technical people are going to be using tooling for writing gemtext
anyway - the rest are perfectly capable of percent-encoding if they

	really really* want to use non-ascii characters.

All that said, I'll make another attempt at leaving this discussion (and
the mailing list) again... hopefully a final decision will be made
soon ;)

bie

Link to individual message.

59. Stephane Bortzmeyer (stephane (a) sources.org)

📅 Sent: 2020-12-25 16:50
📧 Message 59 of 109

On Thu, Dec 24, 2020 at 12:48:50PM +0100,
 marc <marcx2 at welz.org.za> wrote 
 a message of 91 lines which said:

> In this regard having people know learn a new language to interact
> with the internet isn't that much of an imposition,

Specially if it is *my* script. Imagine a chinese person asking that
all URI be in chinese characters because people can learn a new
script, after all. I bet that many proponents of ASCII URIs would not
be so happy.

Link to individual message.

60. Stephane Bortzmeyer (stephane (a) sources.org)

📅 Sent: 2020-12-25 16:53
📧 Message 60 of 109

On Thu, Dec 24, 2020 at 10:49:21PM +0900,
 bie <bie at 202x.moe> wrote 
 a message of 48 lines which said:

> but because I realized how hard it was to verbally convey my
> Japanese addresses to my Norwegian friends and vice versa.

I don't see the point, anyway. If the adresse (the URI) uses the
Japanese writing, it is probably because the content is in Japanese
and/or is interesting only for people who are in Japan. Therefore
either your norwegian friend is in one of these two cases, or you
wouldn't tell him/her the adress, anyway.

> The de facto universality of ASCII

No, the latin script (and even more the ASCII character set) is not
universal (even if it would be simpler for me).

Link to individual message.

61. Stephane Bortzmeyer (stephane (a) sources.org)

📅 Sent: 2020-12-25 16:57
📧 Message 61 of 109

On Wed, Dec 23, 2020 at 09:26:24PM +0100,
 cage <cage-dev at twistfold.it> wrote 
 a message of 17 lines which said:

> > The individual path segments actually.
> >
> > So, given /Foo/Bar/Baz, decompose the path into individual
> > segments Foo, Bar, and Baz, encode these, and reconstruct the
> > path. Easy-peasy.

I don't think that percent-encoding has to be done per path segment. I
don't find anything in RFC 3986 that makes your algorithm
mandatory. "/" is a safe character, anyway so it seems to me that you
can percent-encode the entire path in one operation.

Link to individual message.

62. cage (cage-dev (a) twistfold.it)

📅 Sent: 2020-12-25 19:55
📧 Message 62 of 109

On Fri, Dec 25, 2020 at 05:57:56PM +0100, Stephane Bortzmeyer wrote:

Hi!

> On Wed, Dec 23, 2020 at 09:26:24PM +0100,
>  cage <cage-dev at twistfold.it> wrote
>  a message of 17 lines which said:
>
> > > The individual path segments actually.
> > >
> > > So, given /Foo/Bar/Baz, decompose the path into individual
> > > segments Foo, Bar, and Baz, encode these, and reconstruct the
> > > path. Easy-peasy.
>
> I don't think that percent-encoding has to be done per path segment. I
> don't find anything in RFC 3986 that makes your algorithm
> mandatory. "/" is a safe character, anyway so it seems to me that you
> can percent-encode the entire path in one operation.

Please correct me  if i am wrong  so this means that if given a path like:

"/?/?/c"

it is safe to send to the server

"%2F%C3%A8%2F%C3%A0%2Fc"

instead of

"/%C3%A8/%C3%A0/c"

I can see that percent-decoding both  the two string above returns the
same results:  the first  path. Could  this be  the reason  because no
splitting is needed?

Bye!
C.

Link to individual message.

63. Petite Abeille (petite.abeille (a) gmail.com)

📅 Sent: 2020-12-25 21:54
📧 Message 63 of 109

> On Dec 25, 2020, at 13:07, bie <bie at 202x.moe> wrote:
> 
> All that said, I'll make another attempt at leaving this discussion (and 
the mailing list) again...

Still don't get it. 

You have a perfectly functional gemini server, written in C, using but the 
OpenBSD base system. Fabulous.

Furthermore, you sail through your English-Japanese-Norwegian workflow by 
the simple expedient of transliterating  all identifiers to US-ASCII, ala 
Unidecode!. Terrific.

To top it all, you can dictate the resulting identifiers, in plain 
English, over a rotary phone line to your trilingual Japanese-Norwegian 
friends. Much excellent.

All in all, everything is covered. Nothing to add. Nothing to take away. All set.

If tomorrow, Gemini adopts IRIs, nothing changes for you. Your setup is 
fully upward compatible. You do not have to lift a finger to keep going. 

All stays exactly the same for you.

Of course, no one on your setup can use IRIs. Only URIs. But they don't 
want IRIs anyway. No loss.

Arguably, your setup may not be fully compliant with the letter of the 
spec. No big deal. 99% there. No one is going to sue you. Just a hobby. 

But it's working. Today. For your needs.

On the other hand, it doesn't work for me. 

I do not like transliteration. I want native.

I want my Kabuki file to be named ?.gmi, and not kabuki.gmi, nor 
xn--7q8h.gmi, nor %F0%9F%91%B9.gmi. Nor any other weird encodings. ?.gmi it is.

I do not want to type ?.gmi. I want to copy & paste. I do not type 
identifiers by hand, nor do I dictate them over the phone. Ever. It's 
error prone. And annoying.

But's that me.

I do not want to be dragged to the lowest of the lowest common denominator 
just because you cannot be bothered to support Unicode.

But that's just me.

What I want is Unicode. Because I like to name my file ?.gmi. It's 2020. 
And it's important to me.

Moving to IRIs allows me to use Unicode file names. While not breaking 
anything on your side.

Staying with URIs prevents me from doing what I want. While not changing 
anything for you.

Why do you want to prevent me from using the names I want? 

I do not tell you how to name your files.

Why do you want to tell me?

This could be construed as rude.

	Unidecode! -- plain ASCII transliterations of Unicode text, Sean M. Burke, 2001

Link to individual message.

64. Petite Abeille (petite.abeille (a) gmail.com)

📅 Sent: 2020-12-25 22:03
📧 Message 64 of 109

> On Dec 25, 2020, at 22:10, Stephane Bortzmeyer <stephane at sources.org> wrote:
> 
> On Fri, Dec 25, 2020 at 10:07:37PM +0100,
> Petite Abeille <petite.abeille at gmail.com> wrote 
> a message of 8 lines which said:
> 
>>> I don't think that percent-encoding has to be done per path segment.
>> 
>> Reserved Characters gen-delims "/"
> 
> So?

You are not meant to encode the path separator if you would like to 
preserve the path semantic. If you do, you turn the entire path into one 
segment. Which is certainly not the desired effect most of the time.

Link to individual message.

65. Petite Abeille (petite.abeille (a) gmail.com)

📅 Sent: 2020-12-25 22:16
📧 Message 65 of 109

> On Dec 25, 2020, at 20:55, cage <cage-dev at twistfold.it> wrote:
> "/?/?/c"
> 
> it is safe to send to the server
> 
> "%2F%C3%A8%2F%C3%A0%2Fc"
> 
> instead of
> 
> "/%C3%A8/%C3%A0/c"

Those are two different paths. 

The first one has one segment, with encoded separators. The second one has 
3 segments, properly encoded. Which matches the semantic of your original 
path, which sport 3 segments ?, ?, and c.

Link to individual message.

66. Petite Abeille (petite.abeille (a) gmail.com)

📅 Sent: 2020-12-25 22:30
📧 Message 66 of 109

> On Dec 24, 2020, at 12:48, marc <marcx2 at welz.org.za> wrote:
> 
> deferential to pretty recent concept/meme - the monolingual nation state

( wat? )

If I wish, out of juvenile impertinence, to name my file ?.gmi then I 
should be able to do so without further ado. 

I would find it patronizing to be forced to type %F0%9F%96%95.gmi .

Link to individual message.

67. spinner (gemini (a) stillspinning.cc)

📅 Sent: 2020-12-25 22:49
📧 Message 67 of 109

> > 4. Write links to URLs that I don't control: It'll be more
> > accessible/convenient to be able to write the URL in non-ascii
characters.
>
> I'd actually say it's just *slighty* more convenient. In most cases
> you'll be copying and pasting the URL.

I realized the original list missed one distinction: reading links using a
client as a reader vs reading links using an editor as a content author. So
the initial authoring may be done through copy-pasting, but revisiting that
piece afterwards can leave you unsure about exactly which URL a link is
pointing to, if it's all percent-encoded.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201225/6031
5514/attachment.htm>

Link to individual message.

68. Petite Abeille (petite.abeille (a) gmail.com)

📅 Sent: 2020-12-25 23:09
📧 Message 68 of 109



> On Dec 22, 2020, at 16:13, Solderpunk <solderpunk at posteo.net> wrote:
> 
> Okay, I'm finally getting involved in this discussion.

Thought exercise:

Each time you see URI, replace it with MORSE CODE.
Each time you see IRI, replace it with ASCII.

Debriefing at noon.

Link to individual message.

69. Petite Abeille (petite.abeille (a) gmail.com)

📅 Sent: 2020-12-25 23:20
📧 Message 69 of 109

> On Dec 23, 2020, at 03:54, bie <bie at 202x.moe> wrote:
> 
> doesn't have to be able to parse JPEGs to serve images.

( wat? )

Presently text/gemini mandates 3 different encoding in a link:

=> gemini://punicode/url-encoded utf8

3 different encodings in one line.

3 in 1.

Moving to IRI clean this up to 1 encoding, utf8.

1 in 1.

Link to individual message.

70. Leo (list (a) gkbrk.com)

📅 Sent: 2020-12-25 23:29
📧 Message 70 of 109

Just want to let you know that your email client does not properly send
an In-Reply-To header and breaks threading.

--
Leo

Link to individual message.

71. Petite Abeille (petite.abeille (a) gmail.com)

📅 Sent: 2020-12-25 23:33
📧 Message 71 of 109

> On Dec 26, 2020, at 00:29, Leo <list at gkbrk.com> wrote:
> 
> Just want to let you know that your email client does not properly send
> an In-Reply-To header and breaks threading.

This is not my experience, but I would be happy to be proven wrong, and 
fix it if necessary.

Could you be more specific? What does break, exactly?

Link to individual message.

72. Omar Polo (op (a) omarpolo.com)

📅 Sent: 2020-12-26 00:28
📧 Message 72 of 109

Petite Abeille <petite.abeille at gmail.com> writes:

>> On Dec 23, 2020, at 03:54, bie <bie at 202x.moe> wrote:
>> 
>> doesn't have to be able to parse JPEGs to serve images.
>
> ( wat? )
>
> Presently text/gemini mandates 3 different encoding in a link:
>
> => gemini://punicode/url-encoded utf8
>
> 3 different encodings in one line.
>
> 3 in 1.
>
> Moving to IRI clean this up to 1 encoding, utf8.
>
> 1 in 1.

Not really.  I don't know basically anything about punycode so I can't
comment on that, but IRI allows percent encoding too.

Link to individual message.

73. Omar Polo (op (a) omarpolo.com)

📅 Sent: 2020-12-26 00:32
📧 Message 73 of 109

bie <bie at 202x.moe> writes:
>
> You're kind of correct in the sense that if we just treat the request as
> arbitrary bytes and not as an IRI (no validation, no handling at all),
> it's simple, but I don't think that's the right way to look at this
> issue. Instead, it's about the complexity of proper URI handling vs
> proper IRI handling. Not to mention that IRIs can still have
> percent-encoded characters!

Sorry if it took long for the reply, but I took some time to fix up my
server and now here I am :)

Originally, when I wrote my server I did a really simple routine to
extract the path from a url and that's it.  (plus minor checking)  This
wasn't good, of course.

In the last two days I took the time to write first a proper URL
parser[0], and than extending it to support IRIs[1].  Turns out, once
you have a URL parser (not hard to do at all), you almost have a
complete IRI parser.  As Sean wrote, you basically have to replace the
unreserved rule to allow other utf8 characters and you're done.  And
even if you're uncomfortable doing this, the RFC lists the valid ranges,
so adding a couple of checks isn't the end of the world (if you want to
be 100% compliant, whatever that means).

(And all of this comes from one that has never, ever, implemented a
IRI/URI parser before, that has read for the first time the rfc3986
while writing the code and has successfully -- I believe -- implemented
a full IRI parser in less than 500 lines of C, with comments and
everything, without using anything other than the standard library.
Heck, the parser doesn't even allocates memory.)

> After thinking about this for a while, the biggest issue for me is that
> this is a breaking change. Breaking in the sense that it breaks *every
> single compliant server we already have*! If gemini, which has been
> surprisingly good at resisting breaking spec changes, accepts this, I
> don't see any reason to believe that it won't happen again and again,
> for equally silly reasons.
>
> bie

I don't buy this argument.  It's not like tomorrow we won't be able to
browse gemini unless we update clients/servers.  Valid URI are also
valid IRI, so it's not an armageddon.  The whole thing started (IIRC)
because the spec says "UTF8 URI".  Furthermore, the spec isn't finalised
yet (see for instance the change regarding full url vs relative ones in
the requests).

If you wrote your server for you, you probably won't need to change
anything: from what you wrote, I assume you're serving only files whose
names are ASCII only, so unless you want to host things with funny
names, you're probably good.

Anyway, sorry for the long reply, I didn't want to drag this discussion
too much, really.  Let's see what will be decided :)

Cheers!

[0]:
https://github.com/omar-polo/gmid/commit/33d32d1fd66a577f22f3f33f238e8dac44ec9995
[1]: https://github.com/omar-polo/gmid/commit/df6ca41da36c3f617cbbf3302ab120721ebfcfd2

Link to individual message.

74. Petite Abeille (petite.abeille (a) gmail.com)

📅 Sent: 2020-12-26 00:32
📧 Message 74 of 109



> On Dec 26, 2020, at 01:28, Omar Polo <op at omarpolo.com> wrote:
> 
> Not really.  I don't know basically anything about punycode so I can't
> comment on that, but IRI allows percent encoding too.

Surprise me. Here is my IRI: 

gemini://?/?.gmi 

Show me your URI, and then, justify why it a good thing for me.

Link to individual message.

75. Omar Polo (op (a) omarpolo.com)

📅 Sent: 2020-12-26 00:41
📧 Message 75 of 109

Petite Abeille <petite.abeille at gmail.com> writes:

>> On Dec 26, 2020, at 01:28, Omar Polo <op at omarpolo.com> wrote:
>> 
>> Not really.  I don't know basically anything about punycode so I can't
>> comment on that, but IRI allows percent encoding too.
>
> Surprise me. Here is my IRI: 
>
> gemini://?/?.gmi 
>
> Show me your URI, and then, justify why it a good thing for me.

Sorry, I wasn't saying that gemini://?/%F0%9F%91%B9.gmi [0] is better
than gemini://?/?.gmi (it is not).  Rather, than even with
IRIs you don't want to delete the percent-decoding code in your parser.

[0]     idn2 refuses to punycode ? :/

Link to individual message.

76. Petite Abeille (petite.abeille (a) gmail.com)

📅 Sent: 2020-12-26 00:47
📧 Message 76 of 109



> On Dec 26, 2020, at 01:41, Omar Polo <op at omarpolo.com> wrote:
> 
> Sorry

Let's recap:

IRI gemini://?/?.gmi 
URI gemini://xn--el8h/%F0%9F%91%B9.gmi

? reserved characters which always needs to be encoded, irrespectively of 
any other consideration.

Link to individual message.

77. Jason McBrayer (jmcbray (a) carcosa.net)

📅 Sent: 2020-12-26 02:33
📧 Message 77 of 109

bie <bie at 202x.moe> writes:

> After thinking about this for a while, the biggest issue for me is
> that this is a breaking change. Breaking in the sense that it breaks
> *every single compliant server we already have*!

I think that's a little dramatic. Looking at my server, I need to make a
change in exactly one place: when mapping IRI paths to file paths, I
can no longer use the url decoding library I was using to decode URI
paths, because it mangles Unicode characters. But since I can now be
sure that only IRI reserved characters are encoded, I can just do a
simple substring substitution. It's also a change that is
backward-compatible with old clients.

-- 
Jason McBrayer      | ?Strange is the night where black stars rise,
jmcbray at carcosa.net | and strange moons circle through the skies,
                    | but stranger still is lost Carcosa.?
                    | ? Robert W. Chambers,The King in Yellow

Link to individual message.

78. Steve Phillips (steve (a) tryingtobeawesome.com)

📅 Sent: 2020-12-26 02:54
📧 Message 78 of 109

> gemini://?/?.gmi

Must we permit question marks in Gemini domains at all?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201225/34c7
469a/attachment.htm>

Link to individual message.

79. Petite Abeille (petite.abeille (a) gmail.com)

📅 Sent: 2020-12-26 11:01
📧 Message 79 of 109



> On Dec 26, 2020, at 03:54, Steve Phillips <steve at tryingtobeawesome.com> wrote:
> 
> Must we permit question marks in Gemini domains at all?


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201226/0285
5a15/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: nq050816.gif
Type: image/gif
Size: 29260 bytes
Desc: not available
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201226/0285
5a15/attachment-0001.gif>

Link to individual message.

80. Solderpunk (solderpunk (a) posteo.net)

📅 Sent: 2020-12-26 11:08
📧 Message 80 of 109

On Sat Dec 26, 2020 at 1:32 AM CET, Omar Polo wrote:

> In the last two days I took the time to write first a proper URL
> parser[0], and than extending it to support IRIs[1]. Turns out, once
> you have a URL parser (not hard to do at all), you almost have a
> complete IRI parser. As Sean wrote, you basically have to replace the
> unreserved rule to allow other utf8 characters and you're done. And
> even if you're uncomfortable doing this, the RFC lists the valid ranges,
> so adding a couple of checks isn't the end of the world (if you want to
> be 100% compliant, whatever that means).
>
> (And all of this comes from one that has never, ever, implemented a
> IRI/URI parser before, that has read for the first time the rfc3986
> while writing the code and has successfully -- I believe -- implemented
> a full IRI parser in less than 500 lines of C, with comments and
> everything, without using anything other than the standard library.
> Heck, the parser doesn't even allocates memory.)

This is, more and more, how I'm conceptualising things.
Parsing/validating IRIs is not actually remotely difficult at all.
Algorithmically it's an extremely minor change to parsing/validating
URIs.  The apparent pain exists only because the world has apparently
been very slow about packaging code up for this into major
libraries/languages, probably because HTTP's ASCII-only nature reduces
demand.  If we adopt IRIs, I would actually encourage Gemini software
authors who find their language lacking tools for this not to write
custom code for it that lives only in their software, but to actually
try to get the functionality accepted upstream into standard libraries,
or widely used third-party libraries.  This is generally useful
functionality that's in no way Gemini-specific, and having easy support
for it everywhere makes the world a better place regardless of whether
Gemini thrives or declines.

I don't really think the alleged difficulty of handling IRIs is a good
argument against accepting them.  I'm now more interested in
learning/thinking about normalisation issues, which have been relatively
under discussed so far.  It's possible this is where the real trouble
lies.  Breaking a UTF-8 IRI up into (scheme, authority, path) is not a
substantial hurdle.

Cheers,
Solderpunk

Link to individual message.

81. Solderpunk (solderpunk (a) posteo.net)

📅 Sent: 2020-12-26 11:19
📧 Message 81 of 109

On Thu Dec 24, 2020 at 12:48 PM CET, marc wrote:

> Note how the global telephone system has made it into the furthest
> corners of the planet - arguably further than the internet, and did
> so without worrying about internationalisation relating to their
> URL equivalents (phone numbers)...

This is not really a compelling comparison at all.  Even if different
languages and cultures use different words and symbols for numbers, the
overwhelming majority of them use base 10, meaning there is a
straightforward and unambiguous mapping between them all.  Almost
anybody can read/write and say/hear a phone number in their native
language, making it much easier to memorise and transmit them.  I don't
know if it was ever done (I wouldn't be at all surprised if it was), but
it would be no technical problem at all to manufacture either a DTMF or
rotary phoneset which had ?, ?, ?, etc. printed on it instead of 1,
2, 3 and have it work correctly anywhere on Earth.  Even if this couldn't
be done and people had to learn a foreign system of numeric symbols,
the fact that there's only 10 of them and that they map directly to
native equivalents makes them much easier to learn.  And, of course, the
Arabic numeral system was already widely used across many languages and
cultures before the phone system arrived, and people in all cultures
already had practice reading, writing and memorising numeric values for
many other reasons (calendars and money are ancient technology).

Cheers,
Solderpunk

Link to individual message.

82. cage (cage-dev (a) twistfold.it)

📅 Sent: 2020-12-26 12:10
📧 Message 82 of 109

On Fri, Dec 25, 2020 at 11:16:08PM +0100, Petite Abeille wrote:

Hi!

>
> > On Dec 25, 2020, at 20:55, cage <cage-dev at twistfold.it> wrote:
> > "/?/?/c"
> >
> > it is safe to send to the server
> >
> > "%2F%C3%A8%2F%C3%A0%2Fc"
> >
> > instead of
> >
> > "/%C3%A8/%C3%A0/c"
>
> Those are two different paths.
>  The first one has one  segment, with encoded separators. The second
> one has 3 segments, properly  encoded. Which matches the semantic of
> your original path, which sport 3 segments ?, ?, and c.

This makes sense to me!

Thanks!
C.

Link to individual message.

83. Solderpunk (solderpunk (a) posteo.net)

📅 Sent: 2020-12-26 12:12
📧 Message 83 of 109

On Thu Dec 24, 2020 at 3:29 PM CET, Stephane Bortzmeyer wrote:

> Once the specification is "officially" "final",
> this will be different. AFAIK, it is not the case (otherwise, what
> would be the point of the [spec] topic?)

Anybody could be forgiven for not inferring it from actually looking at
what gets posted to [spec], but the main reason there's a venue for
discussing spec finalisation at all is that there are still lots of
things to be done at the level of "crossing t's and dotting i's".  For
example, somebody recently reminded me off-list that the spec is still
silent on the question of whether or not servers need to use TLS's
close_notify mechanism once they're done sending a response, or whether
it's okay to simply close the TCP connection.  Or see also the fragment
related stuff that people have been posting about  Stuff like this, that
is to say small but important technical details and edge cases, is
actually what I consider the most important task of the [spec] topic.
This stuff ought to be finalised before a formal RFC-style specification
can be written up and potentially submitted to IETF.

The possible change to using IRIs is *by far* the most major change I
have considered making in probably a year.  I do not expect to ever
consider anything this large again, ever (meaning that the fear of
adopting IRIs being a slippery slope to more drastic changes in the
future is unfounded).  I'm taking my time on it because it's a major
change and because internationalisation is IMHO a very important issue,
but make no mistake -  I cannot *wait* for it to done so we can focus
on the smaller, hopefully much less contentious, details and get the
whole thing finalised.

I really consider Gemini 100% complete in terms of scope/capabilities.
People are doing wonderful things with it as is, it's basically
everything I ever dreamed of.  I am very ready to transition to spending
10 x more time and energy reading and writing Gemini content than
managing the protocol.

Cheers,
Solderpunk

Link to individual message.

84. bie (bie (a) 202x.moe)

📅 Sent: 2020-12-26 15:12
📧 Message 84 of 109

> This is, more and more, how I'm conceptualising things.
> Parsing/validating IRIs is not actually remotely difficult at all.
> Algorithmically it's an extremely minor change to parsing/validating
> URIs.  The apparent pain exists only because the world has apparently
> been very slow about packaging code up for this into major
> libraries/languages, probably because HTTP's ASCII-only nature reduces
> demand.  If we adopt IRIs, I would actually encourage Gemini software
> authors who find their language lacking tools for this not to write
> custom code for it that lives only in their software, but to actually
> try to get the functionality accepted upstream into standard libraries,
> or widely used third-party libraries.  This is generally useful
> functionality that's in no way Gemini-specific, and having easy support
> for it everywhere makes the world a better place regardless of whether
> Gemini thrives or declines.
> 
> I don't really think the alleged difficulty of handling IRIs is a good
> argument against accepting them.  I'm now more interested in
> learning/thinking about normalisation issues, which have been relatively
> under discussed so far.  It's possible this is where the real trouble
> lies.  Breaking a UTF-8 IRI up into (scheme, authority, path) is not a
> substantial hurdle.

This is enough of a decision for me, so I'm out. I'm not one to stand in
the way of "progress", however misguided, so I've taken down my 4 gemini
servers.

bie

Link to individual message.

85. Solderpunk (solderpunk (a) posteo.net)

📅 Sent: 2020-12-26 15:20
📧 Message 85 of 109

On Sat Dec 26, 2020 at 4:12 PM CET, bie wrote:

> This is enough of a decision for me, so I'm out. I'm not one to stand in
> the way of "progress", however misguided, so I've taken down my 4 gemini
> servers.

I'm, genuinely and sincerely, sorry to hear this.  Thanks for having run
them for the time you did.  The final decision is still to be made and
in the even that I end up backtracking on this, I hope you'll
reconsider.

Cheers,
Solderpunk

Link to individual message.

86. Petite Abeille (petite.abeille (a) gmail.com)

📅 Sent: 2020-12-26 15:56
📧 Message 86 of 109



> On Dec 26, 2020, at 16:12, bie <bie at 202x.moe> wrote:
> 
> so I've taken down my 4 gemini servers.

This is your prerogative. Sounds like a tantrum though. Disappointing both 
ways. C'est la vie.

Link to individual message.

87. Petite Abeille (petite.abeille (a) gmail.com)

📅 Sent: 2020-12-26 16:10
📧 Message 87 of 109



> On Dec 26, 2020, at 16:56, Petite Abeille <petite.abeille at gmail.com> wrote:
> 
> tantrum

rage-quit from 6 years old on:

rage-quit, verb, INFORMAL, US, angrily abandon an activity or pursuit that 
has become frustrating, especially the playing of a video game.

Now we know.

Link to individual message.

88. Solderpunk (solderpunk (a) posteo.net)

📅 Sent: 2020-12-26 16:12
📧 Message 88 of 109

On Sat Dec 26, 2020 at 5:10 PM CET, Petite Abeille wrote:

> > On Dec 26, 2020, at 16:56, Petite Abeille <petite.abeille at gmail.com> wrote:
> > 
> > tantrum
>
> rage-quit from 6 years old on:
>
> rage-quit, verb, INFORMAL, US, angrily abandon an activity or pursuit
> that has become frustrating, especially the playing of a video game.

People are free to shutdown their servers whenever they want, whyever
they want.  There's no need to taunt them for it.  Please just let it
go.

Cheers,
Solderpunk

Link to individual message.

89. Petite Abeille (petite.abeille (a) gmail.com)

📅 Sent: 2020-12-26 16:18
📧 Message 89 of 109

> On Dec 26, 2020, at 17:12, Solderpunk <solderpunk at posteo.net> wrote:
> 
> People are free to shutdown their servers whenever they want, whyever
> they want.  

Agree.

> There's no need to taunt them for it.

There is a qualitative difference in publicly "threatening" to do so: one 
can always move on quietly.

>  Please just let it go.

Water under the bridge.

Link to individual message.

90. Côme Chilliet (come (a) chilliet.eu)

📅 Sent: 2020-12-26 17:19
📧 Message 90 of 109

I'm pretty sure this is not true. In all cases (uri/iri), percent encoding 
is allowed for any character, so the server has to percent-decode paths 
segment before using them to match a file.

Unless gemini goes for a more severe specification which only allows 
percent-encoding for reserved characters. But this may break a lot of client code.

Link to individual message.

91. Petite Abeille (petite.abeille (a) gmail.com)

📅 Sent: 2020-12-26 19:06
📧 Message 91 of 109



> On Dec 26, 2020, at 18:19, C?me Chilliet <come at chilliet.eu> wrote:
> 
> percent-decode paths segment

Do try it and then report back your findings.

Link to individual message.

92. Petite Abeille (petite.abeille (a) gmail.com)

📅 Sent: 2020-12-26 19:30
📧 Message 92 of 109



> On Dec 26, 2020, at 20:06, Petite Abeille <petite.abeille at gmail.com> wrote:
> 
>> On Dec 26, 2020, at 18:19, C?me Chilliet <come at chilliet.eu> wrote:
>> 
>> percent-decode paths segment
> 
> Do try it and then report back your findings.

For example, do try to roundtrip one path segment: "A/B Testing". 

Do show what happens in each case.

Link to individual message.

93. colecmac (a) protonmail.com (colecmac (a) protonmail.com)

📅 Sent: 2020-12-26 20:25
📧 Message 93 of 109

> Feedback welcome, especially if I've overlooked anything, which is
> certainly possible.  What I'd be most interested in hearing, at this
> point, is client authors letting me know whether the standard library
> in the language their client is implemented in can straightforwardly:
>
> 1. Parse and relativise URLs with non-ASCII characters (so, yes, okay,
>    technically not URLs at all, you know what I mean) in paths and/or
>    domains?
> 2. Transform back and forth between URIs and IRIs?
> 3. Do DNS lookups of IDNs without them being punycoded first?  You can
>    test this with r?ksm?rg?s.josefsson.org.

The main language I use for Gemini software is Go. My clients, Amfora and
gemget, are both programmed using Go, and they use Go's built-in URL
library, called "net/url".

This library cannot properly handle 1, 2, or 3. This likely because the Go
stdlib is high quality, and appears to be coded to follow RFCs very strictly,
and the library was only designed to support URLs, and not IRIs.

For example, it will accept invalid characters in the path when parsing the
URL, but when converting it back into a string, it will percent-encode the
invalid characters. This does not happen with the query string, though.

The fact that paths and query strings are treated differently makes converting
IRIs to URIs not straightforward. And doing the reverse would require taking
the bits of the parsed URL and then decoding them compliantly, and then
stitching them together manually.

As for #3, the Go stdlib looks up the domain in the URL as-is, and will not
punycode anything. I have had to do it myself, which was annoying but not
super difficult. Amfora and gemget both have support for IDNs.

See the link below for how IDN support was added, if it's of interest.

https://github.com/makeworld-the-better-one/go-gemini/compare/a557676343c51
dabbc7d5a112d38bb8095db94d7...2f79af7688e88942d0d51d6ed65617b68a91a733


I believe these difficulties have implications on whether or not IRIs should
be added to the spec, but I'd rather let this email and the facts of the matter
stand on their own.


makeworld

Link to individual message.

94. Dmitry Bogatov (gemini#lists.orbitalfox.eu#v1 (a) kaction.cc)

📅 Sent: 2020-12-26 20:38
📧 Message 94 of 109

On Fri, Dec 25, 2020 at 10:54:17PM +0100, Petite Abeille wrote:
> If tomorrow, Gemini adopts IRIs, nothing changes for you. Your setup is 
fully upward compatible. You do not have to lift a finger to keep going. 
> 
> All stays exactly the same for you.
> 
> Of course, no one on your setup can use IRIs. Only URIs. But they don't 
want IRIs anyway. No loss.
> 
> Arguably, your setup may not be fully compliant with the letter of the 
spec. No big deal. 99% there. No one is going to sue you. Just a hobby. 

Been there, done that. Best viewed with browser %s.

Several such "improvements", and client that was first-class citizen
becomes something like w3m or lynx in modern web, and another dream is
ruined in pursue for aesthetics.

Link to individual message.

95. Petite Abeille (petite.abeille (a) gmail.com)

📅 Sent: 2020-12-26 20:49
📧 Message 95 of 109

> On Dec 26, 2020, at 21:38, Dmitry Bogatov <gemini#lists.orbitalfox.eu#v1 
at kaction.cc> wrote:
> 
> Been there, done that. Best viewed with browser %s.

Fair point, But this concerns a server. Not a client. The server in 
question will never handle Unicode identifiers. Nor does it need to. Zero 
practical impact.

Plus, really, we are already in the "best viewed with xyz" age. Compare 
and contrast, say, LaGrange? and Amphora?. They are both great. In their 
own different ways.

Are you suggesting a normative user experience?

? https://github.com/skyjake/lagrange
? https://github.com/makeworld-the-better-one/amfora

Link to individual message.

96. John Cowan (cowan (a) ccil.org)

📅 Sent: 2020-12-26 21:10
📧 Message 96 of 109

On Fri, Dec 25, 2020 at 11:58 AM Stephane Bortzmeyer <stephane at sources.org>
wrote:


> I don't think that percent-encoding has to be done per path segment. I
> don't find anything in RFC 3986 that makes your algorithm
> mandatory. "/" is a safe character, anyway so it seems to me that you
> can percent-encode the entire path in one operation.
>

Once any necessary punycoding has been done (look for // on the left and
either / or end-of-string on the right), the whole URI can have its
non-ASCII characters %-encoded all at once.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201226/31af
f2f2/attachment.htm>

Link to individual message.

97. Petite Abeille (petite.abeille (a) gmail.com)

📅 Sent: 2020-12-26 21:33
📧 Message 97 of 109

> On Dec 26, 2020, at 22:10, John Cowan <cowan at ccil.org> wrote:
> 
> the whole URI can have its non-ASCII characters %-encoded all at once

Right. But that was not Stephane problematic, which was related to how to 
encode Reserved Characters gen-delims "/" in a path.

Consider the following 3 path segments: "Research", "A/B Testing", "Results".

Stephane asserts the following encodings are equivalent:

Research%2FA%2FB%20Testing%2FResults

vs.
Research/A%2FB%20Testing/Results

They are clearly not. The first variant will result in one path segment, 
with data loss. While the second one will preserve the original semantic, 
with 3 segments, individually encoded, and intact.

They are not equivalent path. Try it in your favorite library.

Link to individual message.

98. Petite Abeille (petite.abeille (a) gmail.com)

📅 Sent: 2020-12-26 22:05
📧 Message 98 of 109



> On Dec 26, 2020, at 21:25, colecmac at protonmail.com wrote:
> 
> This likely because the Go stdlib is high quality, and appears to be 
coded to follow RFCs very strictly, and the library was only designed to 
support URLs, and not IRIs.

Would a strict ANTLR grammar for IRI help?

https://pkg.go.dev/bramp.net/antlr4/iri
https://github.com/antlr/grammars-v4/blob/master/iri/IRI.g4

Link to individual message.

99. Petite Abeille (petite.abeille (a) gmail.com)

📅 Sent: 2020-12-26 22:26
📧 Message 99 of 109

> On Dec 26, 2020, at 21:25, colecmac at protonmail.com wrote:
> 
> appears to be coded to follow RFCs very strictly

Looking at url/url.go, the implementation seems more pragmatic than dogmatic:

https://github.com/golang/go/blob/master/src/net/url/url.go

A fascinating read. But a strict RFC grammar it's not. 

It has gone through a long and tumultuous history:

https://github.com/golang/go/commits/49a210eb87da6b7ac960cac990337ef4dc113b
0d/src/net/url/url.go

Either way, would preprocessing an IRI into an URL help? Similarly to the Java situation.

Link to individual message.

100. Sean Conner (sean (a) conman.org)

📅 Sent: 2020-12-26 23:22
📧 Message 100 of 109

It was thus said that the Great Petite Abeille once stated:
> > On Dec 26, 2020, at 22:10, John Cowan <cowan at ccil.org> wrote:
> > 
> > the whole URI can have its non-ASCII characters %-encoded all at once
> 
> Right. But that was not Stephane problematic, which was related to how to
> encode Reserved Characters gen-delims "/" in a path.
> 
> Consider the following 3 path segments: "Research", "A/B Testing",
> "Results".
> 
> Stephane asserts the following encodings are equivalent:
> 
> Research%2FA%2FB%20Testing%2FResults
> 
> vs.
> Research/A%2FB%20Testing/Results
> 
> They are clearly not. The first variant will result in one path segment,
> with data loss. While the second one will preserve the original semantic,
> with 3 segments, individually encoded, and intact.
> 
> They are not equivalent path. Try it in your favorite library.

  It was interesting to see the Go URL library you linked to.  For your two
examles, it will return the following structures:

	{
	  Path    = "Research/A/B Testing/Results",
	  RawPath = "Research%2FA%2FB%20Testing%2FResults",
	}

	{
	  Path    = "Research/A/B Testing/Results",
	  RawPath = "Research/A%2FB%20Testing/Results",
	}

and it's up to the client to check RawPath if it's *really* necessary to
make the distinction (meaning---the client *still* has to parse the path).

  A more normal example like "Research/ABTesting/Results" will result in:

	{
	  Path    = "Research/ABTesting/Results",
	  RawPath = "",
	}

so it's not like RawPath will always have the path.

  For the record, my own URL parsing library will just return 

	Research/A/BTesting/Results

for both samples.  I found it easier to work with that than what I was doing
at the time (pedantically correct, painfully hard to use in practice). You
would be hard pressed to actually create a file named "A/B Testing" on any
file system I know of (and not have it be "B Testing" in the "A" directory). 
If there *is* a file system that allows slashes in a filename (and not just
a seperator between directories) than I might revisit my decision, but until
then ...

  -spc

Link to individual message.

101. Petite Abeille (petite.abeille (a) gmail.com)

📅 Sent: 2020-12-26 23:27
📧 Message 101 of 109



> On Dec 27, 2020, at 00:22, Sean Conner <sean at conman.org> wrote:
> 
> You would be hard pressed to actually create a file named "A/B Testing" 
on any file system I know of

There is more to life than a file system, a database for example. 

Let's not conflate the limitations of the two.

Link to individual message.

102. Petite Abeille (petite.abeille (a) gmail.com)

📅 Sent: 2020-12-26 23:28
📧 Message 102 of 109



> On Dec 27, 2020, at 00:22, Sean Conner <sean at conman.org> wrote:
> 
>  For the record, my own URL parsing library will just return 
> 
> 	Research/A/BTesting/Results

Tragic. I take back my assessment of your LPEG grammar. It's clearly wrong. Oh well.

Link to individual message.

103. Sean Conner (sean (a) conman.org)

📅 Sent: 2020-12-26 23:52
📧 Message 103 of 109

It was thus said that the Great Petite Abeille once stated:
> 
> 
> > On Dec 27, 2020, at 00:22, Sean Conner <sean at conman.org> wrote:
> > 
> >  For the record, my own URL parsing library will just return 
> > 
> > 	Research/A/BTesting/Results
> 
> Tragic. I take back my assessment of your LPEG grammar. It's clearly
> wrong. Oh well.

  Okay, given your two examples:

	Research%2FA%2FB%20Testing%2FResults
	Research/A%2FB%20Testing/Results

what should a "proper" URL parser return?  And how should client code handle
such a construct?  Perhaps even attempt to write a URL (or IRI) parser
yourself?

  At one point, my URL parser would return the following for these:

	{
	  path =
	  {
	    "Research/A/B Testing/Results",
	  }
	}

	{
	  path = 
	  {
	    "Research",
	    "A/B Testing",
	    "Results",
	  }
	}

but I found working with such paths to be painful.  First off, how to
distinguish between

	Research/A%2FB%20Testing/Results

and

	/Research/A%2FB%20Testing/Results

  How would I specify that any URL with a path starting with "/foo" be
redirected to a path starting with "/bar"?

		/foo/this	-> /bar/this
		/foobar		-> /barbar

  And how would I deal with this in the code?

  Yes, you can say I ruined the purity of my URL parser with an ugly
pragmatic approach (keep the path a string, but decoded and ignore the
semantics of encoded delims), but there's also the saying, "Perfect is the
enemy of good."

  -spc

[1]	https://en.wikipedia.org/wiki/Perfect_is_the_enemy_of_good

Link to individual message.

104. Petite Abeille (petite.abeille (a) gmail.com)

📅 Sent: 2020-12-27 00:08
📧 Message 104 of 109

> On Dec 27, 2020, at 00:52, Sean Conner <sean at conman.org> wrote:
> 
> "Perfect is the enemy of good."

Agree. My own parserss are on the pragmatic side of the spectrum (even 
though path segments are preserved, as I tend to use databases rather than 
file systems). I was hoping you where a better person that I'm, to borrow your own line.

I suspect I should stop hoping for a full-fledge LPEG grammar for MIME 
emerging from Conman's lab :/

Oh well. We are all flawed. Skynet will just crash and segfault. 

No one cares. Even on a mailing list dedicated to designing a protocol, 
one ends up being "pedantic". 

I now fell the same rage-quit as bie.

On the plus side, next time someone dare to mention any RFCs, just punch 
them in the face. Life is too short.

Let's stop pretending.

Link to individual message.

105. Sean Conner (sean (a) conman.org)

📅 Sent: 2020-12-27 00:12
📧 Message 105 of 109

It was thus said that the Great marc once stated:
> >   I have to deal with the telephony network at work.  It *is* the OSI seven
> > layer burrito [1] and even *there* there are baked in assumptions relating
> > to i18n [2].  Text is limited to ASCII.  Yup.  7-bit US-ASCII it all its
> > glory.  Anything else requires some very nasty hacks.  
> 
> Note how the global telephone system has made it into the furthest
> corners of the planet - arguably further than the internet, and did
> so without worrying about internationalisation relating to their
> URL equivalents (phone numbers)...

  Phone numbers are their own special Hell [1].

  The point I was making is that yes, the SS7 protocol, used by telephone
companies around the world, isn't i18n clean.  And it's not like SS7 was
developed in the 1920s ... that's all I'm saying here.  I work for a company
that translates phone numbers (like 800-555-1212) to human readable names
(like "The ACME Company") for delivery to the cell phone receiving a phone
call (so intead of getting "800-555-1212" you get "The ACME Company").  It
was a tremendous amount of engineer to work around the SS7 limitations of 15
US-ASCII characters (and it's a hack really).

  But hey, it's US-ASCII only, so it's "simple" ...

  -spc

[1]	I have to deal with phone numbers as given to us by the Oligarchic
	Cell Phone Companies.  You would think that we would be given valid
	phone numbers as defined by them, but you would be wrong.  We get
	complete trash along with good.  And then my manager's manager wants
	us to pass along all invalid NANP [2] numbers along with the valid
	NANP numbers [3], while excluding all valid international numbers
	...

[2]	North America Numbering Plan, which includes the US, Canada and the
	Carribean, but excludes Mexico and countries south of it.

[2]	Our product is only designed for the US.  This makes it interesting
	because Canada and the Carribean aren't the US, but are part of the
	NANP, which means some "area codes" are actually "country codes" in
	disguise, but I digress ...

Link to individual message.

106. Sean Conner (sean (a) conman.org)

📅 Sent: 2020-12-27 00:26
📧 Message 106 of 109

It was thus said that the Great Petite Abeille once stated:
> > On Dec 27, 2020, at 00:52, Sean Conner <sean at conman.org> wrote:
> > 
> > "Perfect is the enemy of good."
> 
> Agree. My own parserss are on the pragmatic side of the spectrum (even
> though path segments are preserved, as I tend to use databases rather than
> file systems). 

  How do you preseve them?  As the encoded "%2F"?  Do you convert the
encoded values to uppercase?  Lowercase?  Keep them the same?

> I was hoping you where a better person that I'm, to borrow
> your own line.

  But you said it yourself, you fall on the pragmatic side.  

> I suspect I should stop hoping for a full-fledge LPEG grammar for MIME
> emerging from Conman's lab :/

  Well, I do have one [1], although I'm not sure how "full-fledged" it is. 
I also lowercase the actual MIME type (so "TEXT/PLAIN" will become
"text/plain") to make it easier to use the results.

  I even have one for email [2], which can even parse RFC-822 style email
addresses [3], but I'm rethinking how I parse Internet messages as I'm not
entirely happy with my current approach.

> Oh well. We are all flawed. Skynet will just crash and segfault.
> 
> No one cares. Even on a mailing list dedicated to designing a protocol,
> one ends up being "pedantic".
> 
> I now fell the same rage-quit as bie.
> 
> On the plus side, next time someone dare to mention any RFCs, just punch
> them in the face. Life is too short.

  Life is too short to follow the WhatWG "standard" [4], so I guess it's a
"pick your poison" type situtation.

> Let's stop pretending.

  Yeah, let's roll our own crypto and addressing scheme!  What can possibly
go wrong?

  -spc

[1]	https://github.com/spc476/LPeg-Parsers/blob/master/mimetype.lua

[2]	https://github.com/spc476/LPeg-Parsers/blob/master/email.lua

[3]	Muhammed.(I am  the greatest) Ali @(the)Vegas.WBA

[4]	https://url.spec.whatwg.org/#concept-url-parser

Link to individual message.

107. Petite Abeille (petite.abeille (a) gmail.com)

📅 Sent: 2020-12-27 00:46
📧 Message 107 of 109

> On Dec 27, 2020, at 01:26, Sean Conner <sean at conman.org> wrote:
> 
> How do you preseve them?  As the encoded "%2F"?  Do you convert the
> encoded values to uppercase?  Lowercase?  Keep them the same?

The hex code themselves? I tend to normalize them to uppercase. Tradition 
or something. But perhaps we are talking about different things? Or?

>  But you said it yourself, you fall on the pragmatic side.  

Yes. Different pragmatism I guess. Strictly speaking it also depend of the 
context. In general, failfast systems have better survivability odds. But 
at time, one would rather extract as much as one can from imperfect data. 
It depends of the problematic at hand.

> Well, I do have one [1], although I'm not sure how "full-fledged" it is. 

Yes, but this only concerns itself with the content-type header. I mean 
MIME multipart constructs.

>  Life is too short to follow the WhatWG "standard" [4], so I guess it's a
> "pick your poison" type situtation.

Fair enough.

> Yeah, let's roll our own crypto and addressing scheme!  What can possibly
> go wrong?

Now, that would be crazy :)

Link to individual message.

108. Sean Conner (sean (a) conman.org)

📅 Sent: 2020-12-27 01:19
📧 Message 108 of 109

It was thus said that the Great Petite Abeille once stated:
> 
> 
> > On Dec 27, 2020, at 01:26, Sean Conner <sean at conman.org> wrote:
> > 
> > How do you preseve them?  As the encoded "%2F"?  Do you convert the
> > encoded values to uppercase?  Lowercase?  Keep them the same?
> 
> The hex code themselves? I tend to normalize them to uppercase. Tradition
> or something. But perhaps we are talking about different things? Or?

  I meant:  If I gave your URL parsers the string

	Research/A%2fB%20Testing/Results

what would I, as a user, get back?  Would I get a string back?  An array of
segments?  An actual example would be be nice.

> > Well, I do have one [1], although I'm not sure how "full-fledged" it is. 
> 
> Yes, but this only concerns itself with the content-type header. I mean
> MIME multipart constructs.

  Ah.  See, I haven't needed that much functionality yet (and I suspect I
could use my email parsers for that if I really needed it).

  -spc

[1]	Missing footnote.

Link to individual message.

109. Petite Abeille (petite.abeille (a) gmail.com)

📅 Sent: 2020-12-27 01:40
📧 Message 109 of 109

> On Dec 27, 2020, at 02:19, Sean Conner <sean at conman.org> wrote:
> 
>  I meant:  If I gave your URL parsers the string
> 
> 	Research/A%2fB%20Testing/Results
> 
> what would I, as a user, get back?  Would I get a string back?  An array of
> segments?  An actual example would be be nice.

Ultimately, a list of path segments, yes. Similar to your first (correct) 
example. With both an absolute and directory indicator.

In the case above, 3 segments. Not absolute, not a directory. These 
segments are then decoded to whatever string they represents, i.e. 
segment[ 2 ] would contain the string "A/B Testing". As originally 
provided. The URL can always round trip. If not, something is very wrong.

The same problematic applies, to, say, representing an URL in an URL.

So, given the following path segments, "cache", "gemini://host/path", and 
"content", the resulting path should be:

cache/gemini%3A%2F%2Fhost%2Fpath/content

And not:

cache%2Fgemini%3A%2F%2Fhost%2Fpath%2Fcontent

Which is clearly nonsensical.

> Ah.  See, I haven't needed that much functionality yet (and I suspect I
> could use my email parsers for that if I really needed it).

Yes, email.lua would handle a message/rfc822 part. It's a start :)

https://github.com/spc476/LPeg-Parsers/blob/master/email.lua

Link to individual message.

---

Previous Thread: [spec] What to do of fragments when there is a redirection

Next Thread: [tech] Questions about cache