πΎ Archived View for gemi.dev βΊ gemini-mailing-list βΊ 000557.gmi captured on 2024-06-16 at 13:42:37. Gemini links have been rewritten to link to archived content
β¬ οΈ Previous capture (2023-12-28)
-=-=-=-=-=-=-
Hi folks, Okay, I'm finally getting involved in this discussion. Sorry it took me a while, and thanks for your patience. Here's a characteristically long email detailing how my thinking on this front has evolved in just the past few days, starting a new thread with the [spec] topic tag. My a priori thoughts when it became clear that this discussion was turning into a major issue, but before I had delved into any details, were something like this: "Good support for arbitrary languages in Gemini is *important* and worth putting up with a little bit of pain for. This is the reason the `lang` parameter was defined for the text/gemini media type, because a text encoding alone is not sufficient for a client to know to do what native speakers of some languages expect (like render text right to left). As weird and foreign as this stuff might seem to a lot of people, only one (English) of the ten most widely spoken languages in the world (and this doesn't change whether you count only native speakers or all speakers) can be properly represented in ASCII, so bailing on unicode support when it seems too hard is very hard to justify and we should try hard to do the right thing. That said, there obviously has to be an upper limit on complexity. Hopefully we can strike a good balance..." At this point, I'll also add that it was obviously my intention from the very early days that internationalised URLs "just work" in Gemini. The clue to this is that the spec defines Gemini requests in terms of "UTF-8 encoded URLs". Now that I'm a little wiser about these things I realise that URIs (and hence URLs) by definition contain only characters which are encoded identially in UTF-8 and ASCII, so that "UTF-8 encoded URL", while not a contradiction of any sort, is not a particularly powerful concept and does nothing to achieve i18n. But I was certainly naively hoping that it did. In my ideal world, something like an IRI would absolutely work in Gemini with a minimum of fuss. Anyhow, the other night I read RFC 3987. Not word for word, mind you, but more than a casual skim. At which point my thoughts became: "Why on Earth is everybody on the ML banging on about punycode this and normalisation that? None of that would be relevant for Gemini. That complexity is only required to transform IRIs into URIs, which is a workaround for legacy software, document formats and protocols which can't handle IRIs directly. Gemini isn't legacy - if we did a `s/URL/IRL/g` on the spec, we could just pass around UTF-8 encoded IRLs without any of this hassle and things would just work. The spec already [somewhat mistakenly: see above] makes it clear that UTF-8 is to be expected in requests. This is a trivial change, not breaking at all, let's just do it. Of course, conversion of IDNs to punycode for the sake of DNS lookups would still be required because we can't change the reality of deployed DNS infrastructure, but it's insane to think this is the responsibility of every individual client author, it's up to operating systems and standard libraries to abstract this away. Surely they already do this? Let's check... Python 3.7.3 (default, Apr 3 2019, 05:39:12) [GCC 8.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import socket >>> socket.getaddrinfo("r?ksm?rg?s.josefsson.org", 1965) [(<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('2001:9b1:8633::102', 1965, 0, 0)), (<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_DGRAM: 2>, 17, '', ('2001:9b1:8633::102', 1965, 0, 0)), (<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_RAW: 3>, 0, '', ('2001:9b1:8633::102', 1965, 0, 0)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('178.174.241.102', 1965)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_DGRAM: 2>, 17, '', ('178.174.241.102', 1965)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_RAW: 3>, 0, '', ('178.174.241.102', 1965))] Yep, great, wonderful, Python does the punycode stuff for me invisibly, this adds no extra complexity at all!" At this point I was, in my private thoughts, a pretty hardcore IRI advocate, and didn't understand why anybody wouldn't be. Then I did a little more experimenting and realised that DNS lookups in Go don't transparently handle the punycoding like Python does, and I was quite disappointed in Go for that. Then I started reading through all the mailing list posts, and realised that people weren't even upset so much about punycoding IDNs as they were about processing IRIs to e.g. absolutise relative IRIs or add queries. This was considered to require complex third party libraries in most languages. I was kind of baffled by this because doing this kind of operation with IRIs is not substantially different from doing it with URIs (as Sean has shown by actually implementing it) and I couldn't believe that something so trivial wouldn't be well handled by standard libraries in 2020 (and, actually, based on some people's posts to the ML it seems like it often is). At this point my attitude became: "Wow, the uptake of these standardised i18n tools in major programming languages is nothing short of embarrassing. I would be in favour of defining Gemini as using IRLs not URLs, but when e.g. clients written in Go fail to "just work" with these, we do not blame the client authors and ask them to move mountains to work around the deficiencies of their standard libraries, but blame the language implementers. Over time, surely, the existing DNS and URI libraries will all be updated to follow the new standards, and those "broken" clients will suddenly become "working" clients without their authors even having to do anything. It's unfortunate that there will be a transitional period where the Gemini spec is somewhat "aspirational" and some clients necessarily fall short due to the failings of others, but that's better than leaving things as is and having Gemini be forever broken with regards to internationalisation." Then I followed the mailing list threads yet deeper, and reached the point where Jason pointed out that RFC 3987 is only a proposed standard, that it has effectively been abandoned by the IETF, and that now the W3C has its own alternative standardisation of "URL" under active development, which is "extremely WWW-centric" (I'm taking Jason's word for this, I haven't actually looked into the details of this yet). This completely undermines my attitude above, because it makes it much less likely that standard libraries will ever be uniformly upgraded to handle IRIs correctly, and it means we can't take the simple moral highground of saying that the Gemini spec is based on IETF standards and it's not our fault if standard libraries still need to lift their game to reflect those standards. Now I honestly don't know what to think. It has always been a core tenet of Gemini's design that it is made by joining together mature, widely-implemented IETF standards in simple ways, so that no heavy lifting is required to build Gemini software in almost any language on any almost platform because all the parts are "radically familiar". I'm very reluctant to move away from that ideal, it's one of our core strengths. But I also think localisation is important and, within reason, I buy the argument that there's a moral obligation to at least seriously try to fix this, and the fact that other technology stacks like the web have not is no excuse for us to do them same when we have the opportunity to make a fresh start. But these two principles are in hard conflict. There apparently *are* no mature, widely-implemented IETF standards to handle non-ASCII URLs. This sucks, and I really wish it were otherwise, but I (and we, the Gemini community) are, realistically, absolutely powerless to change this, no matter how much we might like to. But, *something* has to be decided. All we can really do is be pragmatic: consider how much pain is required to get some support for internationalised addressing into Gemini, and consider who has to bear that pain. Ideally, we try to minimise the total amount of pain, and preferentially inflict more pain on software authors than on content authors (who are not necessarily developers or even "power users"), and more pain on server authors than on client authors (it's of more benefit to more people for it to be easy to roll your own client than for it to be easy to roll your own server). The options, then, would appear to be: 1. Nothing changes in the spec (except we remove the language about "UTF-8 encoded URLs" because this is, frankly, a recipe for misunderstanding). Gemini runs entirely on URLs using only a subset of ASCII. Clients and servers are permitted to be highly "dumb" in this regard, and no existing software breaks. Ultimate responsibility for internationalised links falls to content authors, who are obligated to fully punycode and percent-encode all their links so they are valid URIs, and if they do this wrong their links don't work and it's nobody's fault but their own, and if they don't understand what any of that even *means* they are forced to use ASCII URLs instead. Client authors who want to be i18n friendly can visually present these links as IRIs if they're up to it, and accept IRIs in the address bar (or equivalent) and encode them before doing name lookups or sending requests. This voluntary extra complexity requires being able to do punycoding and percent-encoding in both the forward and backward directions. 2. We stick to ASCII-only URLs in Gemini requests, but allow IRIs in text/gemini and require all clients to be able to suitably encode IRIs before doing name lookups or sending requests, and to accept IRIs in the address bar. Content authors just write their content in their ordinary editor in an ordinary human-readable way without knowing what punycode or perecent enoding are. All client authors need to be able to do punycoding and percent-encoding in the forward direction only. If no standard library support for this is available, these operations need to be done from scratch. 3. We treat RFC 3987 as a first-class entity in our world, even if the IETF has abandoned it. IRIs are used everywhere, in text/gemini documents and in requests. Nobody ever has to do percent encoding in any direction (beyond what is already required for standard URIs). The forward punycoding requirement remains as per 2. above. However, instead of having to do forward percent encoding, clients now need to be able to do things like absolutise relative IRIs. If no standard library support for this is available, this needs to be done from scratch - although, note that if standard library support for percent encoding forward and backwards is present, then the standard library support for relativising ASCII URLs, which we are basically already asuming is present everywhere, is sufficient to build this up, so this is not anywhere near as scary as it seems. There's an addition wrinkle here in that unicode normalisation needs to be consistent between e.g. the client and server's idea of the domain name. This could, I think be made entirely the server's responsibility, by requiring servers to normalise requests in a particular way. Obviously, option 1. is preferable from the point of view of a spec author or a software implementer, but it has to be acknowledged that it throws international content authors under the bus (it's true there are such authors on the ML who are happily doing exactly what this option requires, but we need to acknowledge that people who can converse in technical English about protocol design on a mailing list are not a representative sample!). From the point of view of international content authors, 2. and 3. are equivalent. It's true that this problem could be minimised by the availability of servers which transform text/gemini content on the fly, and it's true that historically I've been happiest dumping extra complexity on server authors, but I'm not sure this is ideal - users might move their content between hosts and suddenly have their links break, which will seem mysterious to them. Regarding options 2. and 3., from a strictly conceptual/aesthetic perspective, 3. is clearly perferable. It's much nicer not to have to map back and forth between a user's perspective of what addresses look like and a machine's perspective, but to use the same representation for both. The less client-side munging of what's in a link line, the better. And following an *absolute* IRI is actually easier under option 3. than under option 2, because it preserves the beautifully simple idea that to follow a link, you just send the corresponding server exactly what you find in the document, not some transformation of it or some subpart of it. A text/gemini link line is, in fact, a ready-to-use request with a label on it! But we need to consider the implementation burden. Both 2. and 3. require exactly the same punycoding before DNS lookup (and I still hope this will become more and more transparently handled by standard libraries over time), so it comes down to what's more widely supported and what's easiest to implement in the absence of support: percent encoding an IRI to a URI so it can be parsed, possibly absolutised and then sent over the wire as a purely ASCII request, or parsing and possibly absolutising an IRI as-is before sending it as UTF-8? It seems to have been a big point of concern on the ML that IRI parsing is rarely supported in standard libraries and difficult to implement from scratch, and that this totally sinks something like option 3. But it seems to be the case that IRIs can in fact be processed with standard tools in Python and Go, and sort-of-kinda in Java. Of course, that's not everywhere, but the capability doesn't exactly seem rare. And in any environment where option 2. is easy, it seems to me that 3. could be achieved roughly as easily just by transforming an IRI to a URI, parsing that and doing absolutisation with the standard URI tools that we assume exist everywhere, and then translating back to an absolute IRI in the end before sending the request. The basic idea is that transformation from IRI to URI happens as a last resort, only when necessary, and the transformation is reversed as early as possible. The extent and kind of transformation required is directly proportional to how stubbornly ASCII-only the environment is. There might be some environments (seemingly Python could be one) where transformation
On Tue, Dec 22, 2020 at 04:13:06PM +0100, Solderpunk wrote: > Hi folks, Hi! [...] > > Feedback welcome, especially if I've overlooked anything, which is > certainly possible. What I'd be most interested in hearing, at this > point, is client authors letting me know whether the standard library > in the language their client is implemented in can straightforwardly: > > 1. Parse and relativise URLs with non-ASCII characters (so, yes, okay, > technically not URLs at all, you know what I mean) in paths and/or > domains? Before i can answer i need help here, i do not know what "relativise" means, can someone explain (maybe in simple terms ;-)). Bye! C.
On Tue Dec 22, 2020 at 5:23 PM CET, cage wrote: > Before i can answer i need help here, i do not know what "relativise" > means, can someone explain (maybe in simple terms ;-)). Whoops! In fact, I meant "absolutise" - i.e. convert a relative URL into an absolute URL, by using the URL where the relative URL is seen to fill in the scheme, hostname and possibly part of the path. Sorry for the slip up. Cheers, Solderpunk
> On Dec 22, 2020, at 16:13, Solderpunk <solderpunk at posteo.net> wrote: > > Okay, I'm finally getting involved in this discussion. Thanks for the, hmm, textwall :) Glad you are back. To summarize: #1: Make ASCII Great Again. And again. #2: Transcribe between 1 & 3 #3: Take the IRI mantel Hopefully a fair transliteration of intent. Clearly #3 has the most appeal. But yes, nothing comes for free, and pragmatism may drag us back to #1. I don't quite see what #2 is for. Midway compromise? Either way, let's ruminate this.
On Tue, Dec 22, 2020 at 05:34:16PM +0100, Solderpunk wrote: Hi! [...] > > Whoops! In fact, I meant "absolutise" - i.e. convert a relative URL > into an absolute URL, by using the URL where the relative URL is seen to > fill in the scheme, hostname and possibly part of the path. Sorry for > the slip up. No problem, i think I am able to answer now! :) Bye! C.
On Tue Dec 22, 2020 at 5:54 PM CET, Petite Abeille wrote: > To summarize: > > #1: Make ASCII Great Again. And again. > #2: Transcribe between 1 & 3 > #3: Take the IRI mantel > > Hopefully a fair transliteration of intent. Yep, that's about right. :) > Clearly #3 has the most appeal. > > But yes, nothing comes for free, and pragmatism may drag us back to #1. > I don't quite see what #2 is for. Midway compromise? Option 2. doesn't appeal much to me either, but it seems, from my read through of most of the ML posts in the three threads you helpfully linked to, to be quite popular in the community, and it's also apparently more or less what the web does, so it seemed worth listing. Having it explicitly spelled out also makes it easy to compare exactly how much extra work is involved in option 3. compared to this. Cheers, Solderpunk
On Tue, Dec 22, 2020 at 04:13:06PM +0100, Solderpunk wrote: > Hi folks, Hi! [...] > Feedback welcome, especially if I've overlooked anything, which is > certainly possible. What I'd be most interested in hearing, at this > point, is client authors letting me know whether the standard library > in the language their client is implemented in can straightforwardly: The language i written my client with is Common lisp > 1. Parse and relativise [absolutize] URLs with non-ASCII characters (so, yes, okay, > technically not URLs at all, you know what I mean) in paths and/or > domains? The language has no concept of URI; IRI or even URL in the standard library. I am aware of two free/libre libraries but in my experience both have problems. I ended writing my custom parser for URI and IRI, that probably is broken as well. ;-) > 2. Transform back and forth between URIs and IRIs? Before making a request I punycode the domain and percent-encode the query and fragment (should also percent-encode the path?). Anyway there is a third party free library to do percent-encoding and decoding. > 3. Do DNS lookups of IDNs without them being punycoded first? You can > test this with r?ksm?rg?s.josefsson.org. There is library in CL that do punycoding, i wrapped a C library (libidn2) to do the same instead. I can resolve the domain above! :) Bye! C.
> On Dec 22, 2020, at 18:18, cage <cage-dev at twistfold.it> wrote: > > (should also percent-encode the path?) Yes. The individual path segments actually. So, given /Foo/Bar/Baz, decompose the path into individual segments Foo, Bar, and Baz, encode these, and reconstruct the path. Easy-peasy.
On Tue, Dec 22, 2020 at 04:13:06PM +0100, Solderpunk <solderpunk at posteo.net> wrote a message of 278 lines which said: > pointed out that RFC 3987 is only a proposed standard, This specific point is probably irrelevant, since few people care about the difference between "proposed standard" and "standard". HTTP is "proposed standard", too. See RFC 7127 for this classification.
Glad to read all this, it makes a lot of sense. I'm in full support of option 3. In PHP from my experience parse_url can eat up any unicode I throw at it. I did not have to do any DNS lookup as I implemented a server and not a client. Percent decoding is also easy.
On Tue, Dec 22, 2020 at 04:13:06PM +0100, Solderpunk <solderpunk at posteo.net> wrote a message of 278 lines which said: > What I'd be most interested in hearing, at this point, is client > authors letting me know whether the standard library in the language > their client is implemented in can straightforwardly: Tests with Python. All of this is now implemented in the Agunua tool <https://framagit.org/bortzmeyer/agunua>. % agunua gemini://g?meaux.bortzmeyer.org/caf?.gmi # Du caf? Si vous voyez cela, c'est que votre client Gemini g?re les IRI. > 1. Parse and relativise URLs with non-ASCII characters (so, yes, okay, > technically not URLs at all, you know what I mean) in paths and/or > domains? No problem, standard library urllib.parse.urlparse parses IRI. > 2. Transform back and forth between URIs and IRIs? Not directly in the standard library, but the code is simple (attached). > 3. Do DNS lookups of IDNs without them being punycoded first? You can > test this with r?ksm?rg?s.josefsson.org. Yes, punycoding is handled by the standard library socket.getaddrinfo. (May be a violation of RFC 6055 but I did not search further.) There is also this third-party package which I did not test <https://pypi.org/project/rfc3987/>. -------------- next part -------------- A non-text attachment was scrubbed... Name: convert-iri-uri.py Type: text/x-python Size: 1143 bytes Desc: not available URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201222/77f2 451d/attachment.py> -------------- next part -------------- A non-text attachment was scrubbed... Name: convert-uri-iri.py Type: text/x-python Size: 659 bytes Desc: not available URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201222/77f2 451d/attachment-0001.py>
> On Dec 22, 2020, at 17:54, Petite Abeille <petite.abeille at gmail.com> wrote: > > #3: Take the IRI mantel If this ever goes through, we should consider increasing the maximum request size to 4,096 bytes* to keep the number of characters constant.
It was thus said that the Great Petite Abeille once stated: > > > > On Dec 22, 2020, at 16:13, Solderpunk <solderpunk at posteo.net> wrote: > > > > Okay, I'm finally getting involved in this discussion. > > Thanks for the, hmm, textwall :) Glad you are back. > > To summarize: > > #1: Make ASCII Great Again. And again. > #2: Transcribe between 1 & 3 > #3: Take the IRI mantel > > Hopefully a fair transliteration of intent. > > Clearly #3 has the most appeal. > > But yes, nothing comes for free, and pragmatism may drag us back to #1. > > I don't quite see what #2 is for. Midway compromise? 1. Status quo 2. Clients take the hit (have to support both URL and IRI) 3. Clients and servers take the hit (both have to support URL and IRI) -spc
> On Dec 22, 2020, at 23:18, Sean Conner <sean at conman.org> wrote: > > 3. Clients and servers take the hit (both have to support URL and IRI) This being a very equalitarian commune, I say this sounds fair: everybody "take the hit" for the greater good.
On Tue, 22 Dec 2020 16:13:06 +0100 "Solderpunk" <solderpunk at posteo.net> wrote: > 1. Parse and relativise URLs with non-ASCII characters (so, yes, okay, > technically not URLs at all, you know what I mean) in paths and/or > domains? > 2. Transform back and forth between URIs and IRIs? I am using Go, which will do these things as you mentioned. Output from net/url: gemini://r?ksm?rg?s.example.com:3131/?????/hej/hopp??=?#??? Scheme: gemini Path: /?????/hej/hopp EscapedPath: /%C3%A5%C3%A4%C3%B6%C3%BC%C3%BF/hej/hopp RawQuery: ?=? Hostname: r?ksm?rg?s.example.com Port: 3131 RawFragment: ??? EscapedFragment: %C3%A7%C3%A7%C3%A7 > 3. Do DNS lookups of IDNs without them being punycoded first? You can > test this with r?ksm?rg?s.josefsson.org. Go won't do this automatically as mentioned, but there is an experimental standard library project golang.org/x/net/idna that can assist. I think that this is the best approach; the use of IDNA is application dependent and IMO shouldn't be done automatically at such a low level. Note that for Python, Python 3.x will correctly resolve as per your example, but Python 2.x will not. Python 3 also doesn't support IDNA2008 (see https://bugs.python.org/issue17305), which is slightly incompatible with IDNA2003. There is a third party library that supports IDNA2008. As a last resort, client authors should be able to link to e.g. Libidn2, license permitting. In my case the problem with implementing IDNA is not in my application. My client is a browser plugin. The browser (Dillo) doesn't support IDN and development is pretty slow on their end. My plugin inherits this limitation. Even then, I am for option #1 personally. IDN/IRI are presentational problems which I think should be left to the client. IDN/IRI in text/gemini for authors can be solved with tooling, but I am not sure that's desirable. I've attached the source code to a text/gemini formatter that "un-internationalizes" IRIs in a text/gemini document passed on stdin anyway...discovered an HTTP-ism in net/url along the way :) -- Philip -------------- next part -------------- A non-text attachment was scrubbed... Name: gmifmt.go Type: application/octet-stream Size: 1733 bytes Desc: not available URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201223/b244 c0ac/attachment.obj> -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 488 bytes Desc: not available URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201223/b244 c0ac/attachment.sig>
On Tue, Dec 22, 2020 at 11:21:38PM +0100, Petite Abeille wrote: > > > > On Dec 22, 2020, at 23:18, Sean Conner <sean at conman.org> wrote: > > > > 3. Clients and servers take the hit (both have to support URL and IRI) > > This being a very equalitarian commune, I say this sounds fair: everybody "take the hit" for the greater good. Everyone who stays takes the hit, that is. My "threshold" for complexity is "can I write a conforming, relatively strict and safe server only relying on the OpenBSD base system". It's kind of arbitrary, sure, but what isn't. Mandatory IRI support (and there's no real way to keep it optional, considering queries) would push the implementation complexity a step too far for me. bie
> On Dec 23, 2020, at 01:41, bie <bie at 202x.moe> wrote: > > My "threshold" for complexity is "can I write a conforming, relatively > strict and safe server only relying on the OpenBSD base system". It's > kind of arbitrary, sure, but what isn't. Fair enough. But what's the showstopper really? Not sure what the "base system" contains, nor the level at which you interact with it, but it lists Perl as one of its component. Which could handle IRIs*. But if strict ASCII is all what "OpenBSD base system" can do, ever, then so be it. On the other hand, one can always, you know, write such IRI parser on their own. It has been done before. There must be a C compiler somewhere in that base system, no?
On Wed, Dec 23, 2020 at 02:02:06AM +0100, Petite Abeille wrote: > > > > On Dec 23, 2020, at 01:41, bie <bie at 202x.moe> wrote: > > > > My "threshold" for complexity is "can I write a conforming, relatively > > strict and safe server only relying on the OpenBSD base system". It's > > kind of arbitrary, sure, but what isn't. > > Fair enough. But what's the showstopper really? > > Not sure what the "base system" contains, nor the level at which you interact with it, but it lists Perl as one of its component. Which could handle IRIs*. > > But if strict ASCII is all what "OpenBSD base system" can do, ever, then so be it. > > On the other hand, one can always, you know, write such IRI parser on their own. It has been done before. There must be a C compiler somewhere in that base system, no? > > * https://metacpan.org/pod/IRI Should have specified the language (C), too. I'm not going to be pulling in perl, and writing a full-fledged IRI parser from scratch in C sounds profoundly uncomfortable. In any case, it's not about what's possible, just a purely personal opinion about where gemini gets too complex to be fun. I'm not expecting anyone to share my exact preferences, just putting it out there as a single anecdotal data point (from someone who so far has been serving mostly non-ascii content over gemini with no real problems or complaints) bie
> On Dec 23, 2020, at 02:37, bie <bie at 202x.moe> wrote: > > Should have specified the language (C), too. I'm not going to be pulling > in perl, and writing a full-fledged IRI parser from scratch in C sounds > profoundly uncomfortable. Fair enough. And libcurl is of no help either? Or HTParse.c? > in any case, it's not about what's possible, just a purely personal > opinion about where gemini gets too complex to be fun. Ok. Different pain thresholds I guess.
> On Dec 23, 2020, at 02:49, Petite Abeille <petite.abeille at gmail.com> wrote: > > And libcurl is of no help either? Or HTParse.c? Or https://uriparser.github.io , no use either?
> On Dec 23, 2020, at 02:37, bie <bie at 202x.moe> wrote: > > (from someone who so far has been serving > mostly non-ascii content over gemini with no real problems or complaints) Got to say, I don't get it. You have a perfectly functional gemini server running on openbsd, handcrafted in C, serving unicode content without a fuss, handling URIs, and the whole shebang, but suddenly IRIs push you over the brink?! This doesn't add up. But ok. To each their own.
On Wed, Dec 23, 2020 at 03:10:58AM +0100, Petite Abeille wrote: > > > > On Dec 23, 2020, at 02:37, bie <bie at 202x.moe> wrote: > > > > (from someone who so far has been serving > > mostly non-ascii content over gemini with no real problems or complaints) > > Got to say, I don't get it. > > You have a perfectly functional gemini server running on openbsd, handcrafted in C, serving unicode content without a fuss, handling URIs, and the whole shebang, but suddenly IRIs push you over the brink?! > > This doesn't add up. Why doesn't it add up? My server doesn't have to know anything about unicode to serve a text file, just like it doesn't have to be able to parse JPEGs to serve images. IRIs means it *does* have to know something about unicode, which ucs characters are valid IRI characters, that the "private" UCS are only valid in the query part etc etc. bie
> On Dec 23, 2020, at 03:54, bie <bie at 202x.moe> wrote: > > My server doesn't have to know anything about unicode to serve a text > file, just like it doesn't have to be able to parse JPEGs to serve > images. IRIs means it *does* have to know something about unicode, which > ucs characters are valid IRI characters, that the "private" UCS are only > valid in the query part etc etc. Ok, so Unicode again. Fair enough. Where is Plan9 when you need it. Sigh.
> On Dec 23, 2020, at 09:45, Petite Abeille <petite.abeille at gmail.com> wrote: > > Ok, so Unicode again. Fair enough. Where is Plan9 when you need it. Sigh. While at it, anyone running gemini on Plan9? https://9p.io/plan9/ Or perhaps even using Plan 9 from User Space somehow for gemini? https://9fans.github.io/plan9port/
I read that molly brown works on 9front.
> Le 23 d?c. 2020 ? 10:33, Petite Abeille <petite.abeille at gmail.com> a ?crit : > > While at it, anyone running gemini on Plan9? > > https://9p.io/plan9/ > > Or perhaps even using Plan 9 from User Space somehow for gemini? > > https://9fans.github.io/plan9port/ Yes! gemini://9til.de is powered by Molly Brown on 9front (a plan9 fork). ? julienxx
Hi > My "threshold" for complexity is "can I write a conforming, relatively > strict and safe server only relying on the OpenBSD base system". It's > kind of arbitrary, sure, but what isn't. > > Mandatory IRI support (and there's no real way to keep it optional, > considering queries) would push the implementation complexity a step too > far for me. TLDR: I am with bie on the matter. Option 3 is a bridge too far for me too. Wall Of Text: So I value the decency which wants to include all human languages in the gemini ecosystem. But in an effort to be inclusive in one dimension one ends up being exclusive in another dimension, namely in the space of computer languages/host operating systems. It is one thing to find full I8N support in a language such as python (slow batteries included), but what about minorities such tcl, lua, m4 or sed ? Protocols define the interactions between computers. Computers don't speak any human language all, they are programmed in computer languages. And so it strikes me as weird to embed the (combinatorial) complexity of human languages deep in the protocol stack, but risk excluding niche computer languages or operating system... which in some cases are just one man efforts. While the OSI 7 layer network model has its deficiencies ("all models are wrong, some are useful"), it does help us think about a network, from inconvenienced electrons at the lowest layer to high level abstractions at the top. I think internationalisation concern belong in the very highest level of a stack. You expect me to say presentation or application-level, but remember the OSI model is wrong (For instance, things like HTTP or gemini are typically lumped into one application layer, when there many layers to them). The actual highest level is the naive computer uses who gets told to "move the mouse over this and then click on this, like so...". At that level, it might make sense for a gemini browser to be fully localised, and render an url in the local language (maybe even left to right, or top to bottom). But even the layer just below that (the competent user level) this starts leaking. A gemini url starts with "gemini://" - that is ascii text, and even funnier, taken from latin. If a non-english user is confused by english (nay, latin, with no native speakers at all) words, then surely "gemini://" has to be rewritten as "tweling://" or "zwilling://" or whatever farsi, japanese or mongolian use for "twin". If not, then an full ascii text url should be manageable too... an url is primarily a computer address. Long ago I came across a version of (I think it was) Pascal had been localised into french with language keywords like "begin" and "if" replaced. I am sure somebody can justify this somehow, but I thought this was an impediment to interoperability, and view the internationalising of computer protocols (as opposed to the user interfaces) in a similar way. regards marc
> On Dec 23, 2020, at 10:47, Julien Blanchard <julien at typed-hole.org> wrote: > > Yes! gemini://9til.de is powered by Molly Brown on 9front (a plan9 fork). Wicked! In a very good way :) Any client side niftiness?
> Le 23 d?c. 2020 ? 11:45, Petite Abeille <petite.abeille at gmail.com> a ?crit : > > Wicked! In a very good way :) > > Any client side niftiness? Of course, there are (to my knowledge) gemnine https://git.sr.ht/~ft/gemnine and my own castor9 https://git.sr.ht/~julienxx/castor9 ? julienxx -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201223/1b25 a8ca/attachment.htm>
> On Dec 23, 2020, at 11:00, marc <marcx2 at welz.org.za> wrote: > > TLDR: > > I am with bie on the matter. Option 3 is a bridge > too far for me too. > > Wall Of Text: Actually, I do have sympathy for your position. And, yes, you are technically correct. The best kind of correct. No arguments here. Unicode is a big pill to swallow. This is why we have been stuck with ASCII for so long. And yes, technically, everything can be transcoded back and forth between ASCII and The World. Machines talking to machines. But, personally, I think this is missing the bigger picture about what Gemini is about. It's not purely a technical endeavor. After all -as people keep pointing out ad nauseam- if you want gopher/http/whatnot, you know where to find them. Gemini has a humanistic stride to it. Some poetry, dare I say. Esthetics matters. A human touch matters. This is why Unicode matters. It's the human face of a technology. This counts for something. As someone used to say: "Technology alone is not enough". This should strike a cord with a community rooted in gopher, of all things. Gopher is not a "technology", it's a community, in the best sense of the term: people talking and sharing with other people. An exchange of ideas. This is what Gemini cares about: people. Not technology. Even if technology is necessary to achieve its humanistic goals. It's therefore my opinion that technologists like us should make the extra effort to make our technology as human friendly as possible. Even if this cost us something. We can do it. For the community. ?The details are not the details; they are the product? -- Charles and Ray Eames
> On Dec 23, 2020, at 12:02, Julien Blanchard <julien at typed-hole.org> wrote: > > >> Le 23 d?c. 2020 ? 11:45, Petite Abeille <petite.abeille at gmail.com> a ?crit : >> >> Wicked! In a very good way :) >> >> Any client side niftiness? > > Of course, there are (to my knowledge) gemnine https://git.sr.ht/~ft/gemnine and my own castor9 https://git.sr.ht/~julienxx/castor9 > Double wicked! Always had a soft spot for plan9 :)
On Wed, Dec 23, 2020 at 11:54:16AM +0900, bie wrote: > My server doesn't have to know anything about unicode to serve a text > file, just like it doesn't have to be able to parse JPEGs to serve > images. IRIs means it *does* have to know something about unicode, > which ucs characters are valid IRI characters, that the "private" UCS > are only valid in the query part etc etc. Exactly. Do not push insance complexity of Unicode on everybody, including those who do not need it. Good thing about current state of affairs is that server can treat unicode as opaque bytestring, and client does not need to be aware of unicode either: to locate links plain strstr(gmi, "=>") is enough, and client can just dump response to stdout, and let terminal driver to deal with that. Or not deal. Anything but option #1 is too much complexity in my opinion. By the way, I really don't understand all this fuss about Unicode links. Seriously, why? We have $ rm --recursive --no-preserve-root /* for generations, and nobody bothered to "internalized" it into something like $ ?? --?????????? --??-?????????-?????? /*
> On Dec 23, 2020, at 12:54, Dmitry Bogatov <gemini#lists.orbitalfox.eu#v1 at kaction.cc> wrote: > > By the way, I really don't understand all this fuss about Unicode links. That's perfectly fine. As adults, we should be able to hold and comprehend two divergent ideas at the same time. We can agree to disagree. It's not about who is "right". Both side are correct. They just have different values. It's a choice. That's all.
>> On Dec 23, 2020, at 09:45, Petite Abeille <petite.abeille at gmail.com> wrote: > While at it, anyone running gemini on Plan9? gemini://provisoire.ca/ (quite new, little content) is hosted on Plan9 via rc-gemd and I use castor9 as my primary client. S -- Shawn Nock <shawn at provisoire.ca>
On Dec 23, 2020, at 09:45, Petite Abeille <petite.abeille at gmail.com> wrote: > While at it, anyone running gemini on Plan9? Yes my site[0] is run using my own rc-gemd[1] on 9front. For a client I generally use gemnine. It should be possible to use rc-gemd from plan9port but you would need some sort of UNIX tlsserver and aux/listen1. On 12/23/20 7:38 AM, Shawn Nock wrote: > gemini://provisoire.ca/ (quite new, little content) is hosted on Plan9 > via rc-gemd and I use castor9 as my primary client. I am very happy you were able to make use of rc-gemd ? Cheers, Moody [0] gemini://posixcafe.org [1] http://shithub.us/git/moody/rc-gemd/HEAD/info.html
"Solderpunk" <solderpunk at posteo.net> writes: Answering for Common Lisp, as best I know (I'm kind of a n00b). Detailed Common Lisp spam below, skip if you are afraid of parentheses. > 1. Parse and relativise URLs with non-ASCII characters (so, yes, okay, > technically not URLs at all, you know what I mean) in paths and/or > domains? No URI handling in the standard library, but quicklisp has libraries for it. I'm using quri, which I think is the most used, and it seems to be fine. CL-USER> (defparameter *my-iri* (quri:uri "gemini://r?ksm?rg?s.josefsson.org/?/?.gmi"))
> On Dec 23, 2020, at 17:05, Jason McBrayer <jmcbray at carcosa.net> wrote: > >> Getting good data on all three of these questions for a wide range >> of languages is necessary to make a well-informed decision here. > > Personally, I would be most gratified if option 3 proved to be workable. Adding my 2?... as a Lua aficionado -with its longstanding DIY ethos- I see no issues whatsoever. Also, kudos to Sean Conner for being the standard-bearer for Lua in the Gemini space. I'm personally always in awe at his mastery of Parsing Expression Grammars. A work of true beauty. Thanks Sean! :)
On Tue, Dec 22, 2020 at 04:13:06PM +0100, Solderpunk wrote: > It's true that this would be a breaking change, although of a different > kind from other breaking changes I've pushed back against in the past. > It's not as if Geminispace would suddenly become impossible to access, > or would split into two totally incompatible subspaces based on the > old and new protocol versions. Any currently extant Gemini document > which included ASCII-only links would remain perfectly accessible > by old and new clients alike. So, it's a relatively soft break. > Given the importance of first-class internationalisation support, > it might be worthwhile. It's only a soft break if you don't consider the query string. Even if all the names on my server are ASCII only, a change to IRIs means I'll be forced to update my server to remain compatible since I have dynamic scripts that accept content through the query string. Solution 3 is a "hard no" for me - the increased complexity is not something I'm willing to take on, and I'll most likely just end up shutting down my servers. bie
Sean Conner <sean at conman.org> writes: > 1. Status quo > 2. Clients take the hit (have to support both URL and IRI) > 3. Clients and servers take the hit (both have to support URL and IRI) Looking at 2, servers still have to take a hit here. i. They need to de-punycode the hostname to compare it to configured virtual host names (unless virtual host names are configured in punycode). ii. They need to url-decode the path in order to find matching file names; they have to do this already to handle reserved characters, though. So the support needed for servers is possibly similar between 2 and 3. Clients are hit a little harder by 2. -- +-----------------------------------------------------------+ | Jason F. McBrayer jmcbray at carcosa.net | | A flower falls, even though we love it; and a weed grows, | | even though we do not love it. -- Dogen |
> On Dec 23, 2020, at 14:38, Shawn Nock <shawn at provisoire.ca> wrote: > > gemini://provisoire.ca/ Lovely domain name. /provisoire/ adjectif Qui existe, se fait en attendant autre chose, ou d'?tre remplac?.
> On Dec 23, 2020, at 17:31, bie <bie at 202x.moe> wrote: > > Solution 3 is a "hard no" for me - the increased complexity is not > something I'm willing to take on, and I'll most likely just end up > shutting down my servers. It is what it is.
Although my server is written in Clojure, I'm leveraging the Java standard libraries in Space Age since there is little value in reinventing the wheel here. In Java world, URIs can be parsed and generated with java.net.URI. This class accepts URIs with Unicode characters in the path, query, and fragment segments. However, it will throw an exception if Unicode characters are included in the domain name. Conversion between Unicode and punycode can be done with java.net.IDN. ``` Clojure 1.10.1 user=> (import 'java.net.IDN) java.net.IDN user=> (IDN/toUnicode "xn--9dbne9b.com") "????.com" user=> (IDN/toASCII "????.com") "xn--9dbne9b.com" ``` Easy peasy. Sadly, there is no java.net.IRI. So if we went with options 2 or 3, I would need to manually parse the Gemini request into segments (not particularly challenging, of course). Then I could use java.net.IDN to perform punycode-to-Unicode or Unicode-to-punycode encoding (depending on whether we went with option 2 or 3) to perform robust virtual hostname lookups (and presumably SNI verification as well). Finally, I'd need to use java.net.URI to combine the punycoded domain name back with the path, query, and fragment segments into a valid URI that I could then parse and percent-decode without throwing an exception. All of this should be doable with a bit of custom logic wrapped around the Java standard library, so I think either option 2 or 3 should be technically feasible from my end (or for anyone else using a language that compiles to Java bytecode). Happy hacking, Gary -- GPG Key ID: 7BC158ED Use `gpg --search-keys lambdatronic' to find me Protect yourself from surveillance: https://emailselfdefense.fsf.org ======================================================================= () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments Why is HTML email a security nightmare? See https://useplaintext.email/ Please avoid sending me MS-Office attachments. See http://www.gnu.org/philosophy/no-word-attachments.html
On Tue, Dec 22, 2020 at 06:23:51PM +0100, Petite Abeille wrote: > > > > On Dec 22, 2020, at 18:18, cage <cage-dev at twistfold.it> wrote: > > > > (should also percent-encode the path?) > > Yes. > > The individual path segments actually. > > So, given /Foo/Bar/Baz, decompose the path into individual segments Foo, Bar, and Baz, encode these, and reconstruct the path. Easy-peasy. Seems simple, but i can make mess even with simple things. :) Thank you! :) Bye!
It was thus said that the Great Solderpunk once stated: > Feedback welcome, especially if I've overlooked anything, which is > certainly possible. What I'd be most interested in hearing, at this > point, is client authors letting me know whether the standard library > in the language their client is implemented in can straightforwardly: > > 1. Parse and relativise URLs with non-ASCII characters (so, yes, okay, > technically not URLs at all, you know what I mean) in paths and/or > domains? > 2. Transform back and forth between URIs and IRIs? > 3. Do DNS lookups of IDNs without them being punycoded first? You can > test this with r?ksm?rg?s.josefsson.org. For C, I'm sure there is code, somewhere, that can parse IRIs, but it's a matter of finding them. For Lua, the answers are: 1. Yes. I had to write some code [1][2], and modify some existing code [3], but Lua now has modules to parse IRI and URIs. 2. I can do IRI->URL, but not the other way---I have no need of a URL->IRI as of yet. 3. For my setups (systems I've been able to test), I cannot lookup IDNs as is---I *have* to convert to punycode first. -spc [1] https://github.com/spc476/LPeg-Parsers/blob/master/iri.lua [2] https://github.com/spc476/lua-conmanorg/blob/master/src/idn.c [3] https://github.com/spc476/GLV-1.12556/blob/master/Lua/GLV-1/url-util.lua
It was thus said that the Great bie once stated: > > Should have specified the language (C), too. I'm not going to be pulling > in perl, and writing a full-fledged IRI parser from scratch in C sounds > profoundly uncomfortable. So what library or code are you using now to parse URIs? When I wrote my IRI parser [1] I took my existing URL parser [2], and just changed the unreserved rule: ASCII: ALPHA / DIGIT / '-' / '.' / '_' / '~' UTF-8: ALPHA / DIGIT / '-' / '.' / '_' / '~' / utf8 where 'utf8' is any character 128 or higher. I didn't bother with restricting the private UCS set to the query because sometimes I think RFC authors are too concerned with theory [3] than with practice and complicate things. Now the conversion of a domain name to punycode on the other hand ... I left that to libidn. -spc [1] https://github.com/spc476/LPeg-Parsers/blob/master/iri.lua [2] https://github.com/spc476/LPeg-Parsers/blob/master/url.lua [3] A charitable way of saying "smoking crack." I mean, RFC-822 (written in 1982) allows: "Look! I'm smoking some good stuff" (no really its good) @ berkeley (in California) . edu as a valid email address! (spaces and all) No, really, look it up!
It was thus said that the Great marc once stated: > > It is one thing to find full I8N support in a language such as python > (slow batteries included), but what about minorities such tcl, lua, m4 or > sed ? I have Lua covered. I can't say for the others (other than, you really use m4? You are a better man than I am, Gunga Din). > I think internationalisation concern belong in the very highest level of a > stack. You expect me to say presentation or application-level, but > remember the OSI model is wrong (For instance, things like HTTP or gemini > are typically lumped into one application layer, when there many layers to > them). The actual highest level is the naive computer uses who gets told > to "move the mouse over this and then click on this, like so...". At that > level, it might make sense for a gemini browser to be fully localised, and > render an url in the local language (maybe even left to right, or top to > bottom). I have to deal with the telephony network at work. It *is* the OSI seven layer burrito [1] and even *there* there are baked in assumptions relating to i18n [2]. Text is limited to ASCII. Yup. 7-bit US-ASCII it all its glory. Anything else requires some very nasty hacks. Even better, there does exist a way to relate a name to a phone number, but it's restricted to just 15 bytes of US-ASCII. So "Rafaella Gabriela Sarsaparilla" gets cut to "Rafaella Gabrie". Lovely, isn't it? > But even the layer just below that (the competent user level) this starts > leaking. A gemini url starts with "gemini://" - that is ascii text, and > even funnier, taken from latin. If a non-english user is confused by > english (nay, latin, with no native speakers at all) words, then surely > "gemini://" has to be rewritten as "tweling://" or "zwilling://" or > whatever farsi, japanese or mongolian use for "twin". If not, then an full > ascii text url should be manageable too... an url is primarily a computer > address. Sushi comes from Japanese, gesundheit from German, sauna from Finnish, smorgasbord from Swedish, borscht from Russian and ketchup from China, what's your point? All those are perfectly cromulent (from Simpsons) words. Modern English sucks up words from all other languages. Also, what's the Japanese equivalent of 'https'? I'm curious. > Long ago I came across a version of (I think it was) Pascal > had been localised into french with language keywords > like "begin" and "if" replaced. It wasn't H?stad [3], was it? If it was, I made that up to make a point about LISP. But yes, there have been several such localizations in the past for various languages but they never caught on internationally for some reason. One language I heard about, Cornerstone, used a novel method for identifiers---the visual representation was not part of the code but from a map---change a variable name in one place, and every place that variable appeared would change its name. Pretty cool concept if you ask me. -spc [1] And a complete pain to work with. Fortunately, it's becoming less and less of an issue as things are transitioning to the Internet, but the phone companies are fighting and screaming all the way. [2] ??t?r??t????l?z?t??? [3] http://boston.conman.org/2008/01/04.1
It was thus said that the Great Petite Abeille once stated: > > Also, kudos to Sean Conner for being the standard-bearer for Lua in the > Gemini space. > > I'm personally always in awe at his mastery of Parsing Expression > Grammars. A work of true beauty. > > Thanks Sean! :) You're welcome. -spc (I still seem to be the only one to have a server in Lua)
Hello > > It is one thing to find full internationalisation support in a language such as python > > (slow batteries included), but what about minorities such tcl, lua, m4 or > > sed ? > > I have Lua covered. I can't say for the others (other than, you really > use m4? You are a better man than I am, Gunga Din). So I do use m4 - it can be quite nifty to generate latex fragments, but that is because latex doesn't play as nicely with pipes as (g)roff where one can just stream things in... m4 doesn't strike me as that special ? Prolog and postscript felt far more exotic to me, and web servers have been written in the latter... > I have to deal with the telephony network at work. It *is* the OSI seven > layer burrito [1] and even *there* there are baked in assumptions relating > to i18n [2]. Text is limited to ASCII. Yup. 7-bit US-ASCII it all its > glory. Anything else requires some very nasty hacks. Note how the global telephone system has made it into the furthest corners of the planet - arguably further than the internet, and did so without worrying about internationalisation relating to their URL equivalents (phone numbers)... > > But even the layer just below that (the competent user level) this starts > > leaking. A gemini url starts with "gemini://" - that is ascii text, and > > even funnier, taken from latin. If a non-english user is confused by > > english (nay, latin, with no native speakers at all) words, then surely > > "gemini://" has to be rewritten as "tweling://" or "zwilling://" or > > whatever farsi, japanese or mongolian use for "twin". If not, then an full > > ascii text url should be manageable too... an url is primarily a computer > > address. > > Sushi comes from Japanese, gesundheit from German, sauna from Finnish, > smorgasbord from Swedish, borscht from Russian and ketchup from China, > what's your point? The insinuation was that internationalised URLs are essential because people who don't speak english at all might not be able to comprehend or (if their input system is sufficiently different) generate ascii/latin text. And my argument is that this doesn't make sense, as every gemini url starts with "gemini://" which is ascii text in a language that nobody speaks anymore. And if people can manage to type "gemini://" then a bit more ascii in the hostname or even path should be quite manageable too even for "people who use scripts like arabic, chinese, devanageri, etc." to quote another list participant. A pity that I failed to convey this point properly - you and I (and bie, and some others) have had a very similar conversation on the 7th and 9th of this month (under the subject "IDN with Gemini"), where I tried to explain my position that I view as a language as a communications protocol and not the property of an ethnicity or nation. The desire to be inclusive is good, but we are deferential to pretty recent concept/meme - the monolingual nation state, which is say 200 or 300 years old. Before that (at least in europe, but elsewhere too) each little region had pretty strong regional dialect or even language (limited mobility or literacy allows for rapid linguistic drift). People who were educated spoke a second or third language to interact with the clergy or the palaces far away. In this regard having people know learn a new language to interact with the internet isn't that much of an imposition, but a return to the way things were... just scaled up to the size of the planet. > All those are perfectly cromulent (from Simpsons) words. > Modern English sucks up words from all other languages. Older english does too: That's why a dead cow is beef. All languages do, absent a (religious or state-sponsored) authority enforcing a level of purity aka stasis. Living languages evolve. > But yes, there have been several such localizations in the past for > various [programming] languages but they never caught on internationally > for some reason. Isn't that yet another hint ? That the point of a language is to communicate, not to serve as a barrier, despite the machinations of nationalists ? regards marc
bie <bie at 202x.moe> writes: > > My server doesn't have to know anything about unicode to serve a text > file, just like it doesn't have to be able to parse JPEGs to serve > images. IRIs means it *does* have to know something about unicode, which > ucs characters are valid IRI characters, that the "private" UCS are only > valid in the query part etc etc. > > bie I think we're in the same boat, as I have written from scratch my server using only stuff that's in base on OpenBSD too. Initially I was totally for option #3 (but I've that I've just finished skimming through the RFC), but by reading your messages I was a little scared of the consequences. Today I did some light testing, and it seems that (IF I'm understanding everything correctly -- please correct me otherwise) that option #3 is actually simpler for us. Current state of the affairs: both Lagrange (0.13.1), amfora and elpher will encode "gemini.omarpolo.com/caf?.gmi" as "gemini.omarpolo.com/caf%C3%A8.gmi". Obviously open("caf%C3%A8.gmi") fails, so my server return 51 because the actual file name is "caf?.gmi". I have to write code that de-encode parts of the request if I want to serve a file named like that (spoiler: I'm not gonna write it). With IRI: the request becomes "gemini://gemini.omarpolo.com/caf?.gmi", so open("caf?.gmi") doesn't fail. I think that we can continue to treat the request as a bytestring, extract the path and try to open(2) it. I know that what I'm proposing is a really poor-man solution, because it doesn't matter we choose option #1, #2 or #3 as we can't really treat the path in the URL/IRL as a bytestring and call it a day. UNIX file names are real bytestring with only two forbidden octet, URL/IRI aren't. So, if I'm not missing anything, I'm all in for option #3.
On Thu, Dec 24, 2020 at 01:39:16PM +0100, Omar Polo wrote: > I think we're in the same boat, as I have written from scratch my server > using only stuff that's in base on OpenBSD too. > > Initially I was totally for option #3 (but I've that I've just finished > skimming through the RFC), but by reading your messages I was a little > scared of the consequences. > > Today I did some light testing, and it seems that (IF I'm understanding > everything correctly -- please correct me otherwise) that option #3 is > actually simpler for us. > > Current state of the affairs: both Lagrange (0.13.1), amfora and elpher > will encode "gemini.omarpolo.com/caf?.gmi" as > "gemini.omarpolo.com/caf%C3%A8.gmi". Obviously open("caf%C3%A8.gmi") > fails, so my server return 51 because the actual file name is > "caf?.gmi". I have to write code that de-encode parts of the request if > I want to serve a file named like that (spoiler: I'm not gonna write it). > > With IRI: the request becomes "gemini://gemini.omarpolo.com/caf?.gmi", > so open("caf?.gmi") doesn't fail. I think that we can continue to treat > the request as a bytestring, extract the path and try to open(2) it. > > I know that what I'm proposing is a really poor-man solution, because it > doesn't matter we choose option #1, #2 or #3 as we can't really treat > the path in the URL/IRL as a bytestring and call it a day. UNIX file > names are real bytestring with only two forbidden octet, URL/IRI > aren't. > > So, if I'm not missing anything, I'm all in for option #3. You're kind of correct in the sense that if we just treat the request as arbitrary bytes and not as an IRI (no validation, no handling at all), it's simple, but I don't think that's the right way to look at this issue. Instead, it's about the complexity of proper URI handling vs proper IRI handling. Not to mention that IRIs can still have percent-encoded characters! After thinking about this for a while, the biggest issue for me is that this is a breaking change. Breaking in the sense that it breaks *every single compliant server we already have*! If gemini, which has been surprisingly good at resisting breaking spec changes, accepts this, I don't see any reason to believe that it won't happen again and again, for equally silly reasons. bie
On Thu, Dec 24, 2020 at 12:48:50PM +0100, marc wrote: > The insinuation was that internationalised URLs are essential > because people who don't speak english at all might not be > able to comprehend or (if their input system is sufficiently > different) generate ascii/latin text. > > And my argument is that this doesn't make sense, as > every gemini url starts with "gemini://" which is ascii text > in a language that nobody speaks anymore. And if people can manage > to type "gemini://" then a bit more ascii in the hostname or > even path should be quite manageable too even for "people who > use scripts like arabic, chinese, devanageri, etc." to quote > another list participant. > > A pity that I failed to convey this point properly - > you and I (and bie, and some others) have had a very similar > conversation on the 7th and 9th of this month (under the subject > "IDN with Gemini"), where I tried to explain my position that > I view as a language as a communications protocol and not the > property of an ethnicity or nation. > > The desire to be inclusive is good, but we are deferential > to pretty recent concept/meme - the monolingual nation state, > which is say 200 or 300 years old. Before that (at least > in europe, but elsewhere too) each little region had pretty strong > regional dialect or even language (limited mobility or literacy > allows for rapid linguistic drift). People who were educated spoke > a second or third language to interact with the clergy or the palaces > far away. > > In this regard having people know learn a new language to interact > with the internet isn't that much of an imposition, but a return > to the way things were... just scaled up to the size of the planet. Just an anecdote I briefly brought up on IRC... I briefly experimented with percent-encoded Japanese and Norwegian addresses on some of my capsules, but quickly gave up and went back to pure ASCII. *Not* because typing in percent-encoded names was annoying, but because I realized how hard it was to verbally convey my Japanese addresses to my Norwegian friends and vice versa. The de facto universality of ASCII might something to embrace, not something to run away from, if we want to be serious about being inclusive. (marc - your posts in this thread have been great.. really appreciate them!) bie
On Thu, Dec 24, 2020 at 10:36:43PM +0900, bie <bie at 202x.moe> wrote a message of 46 lines which said: > After thinking about this for a while, the biggest issue for me is > that this is a breaking change. Breaking in the sense that it breaks > *every single compliant server we already have*! If gemini, which > has been surprisingly good at resisting breaking spec changes, > accepts this, I don't see any reason to believe that it won't happen > again and again, As I explained in <gemini://gemi.dev/gemini-mailing-list/messages/004178.gmi>, I do not think that backward compatibility should be a goal, since Gemini is still experimental. Once the specification is "officially" "final", this will be different. AFAIK, it is not the case (otherwise, what would be the point of the [spec] topic?) To answer your question: once the spec is "officially" adopted, it makes sense to resist changes. We are not at this stage yet. > for equally silly reasons. Internationalization is certainly not a silly reason.
On Thu, Dec 24, 2020 at 03:29:48PM +0100, Stephane Bortzmeyer wrote: > As I explained in > <gemini://gemi.dev/gemini-mailing-list/messages/004178.gmi>, I do > not think that backward compatibility should be a goal, since Gemini > is still experimental. Once the specification is "officially" "final", > this will be different. AFAIK, it is not the case (otherwise, what > would be the point of the [spec] topic?) In that case you should read the first part of the current specification: "Although not finalised yet, further changes to the specification are likely to be relatively small. You can write code to this pseudo-specification and be confident that it probably won't become totally non-functional due to massive changes next week, but you are still urged to keep an eye on ongoing development of the protocol and make changes as required." Now you might consider this proposed to change to be small enough or important enough to still make sense. I do not. > To answer your question: once the spec is "officially" adopted, it > makes sense to resist changes. We are not at this stage yet. > > > for equally silly reasons. > > Internationalization is certainly not a silly reason. You don't need IRIs for internationalization. So yes, it is a silly reason. bie
On Wed, Dec 23, 2020 at 11:00:58AM +0100, marc <marcx2 at welz.org.za> wrote a message of 79 lines which said: > So I value the decency which wants to include all > human languages in the gemini ecosystem. Actually, all human *scripts*. In any case, a Gemini client or server won't have to understand the language. (Mandatory AI in Gemini?) > But in an effort to be inclusive in one dimension one ends up being > exclusive in another dimension, namely in the space of computer > languages/host operating systems. We already do it with the mandatory TLS: some systems cannot run Gemini (imagine a Gemini server in assembly language). > It is one thing to find full I8N support in a language such as > python (slow batteries included), but what about minorities such > tcl, lua, m4 or sed ? Lua is not a good example since the core language is, by design, stricly limited. Any real Lua program uses several third-party libraries. > And so it strikes me as weird to embed the (combinatorial) > complexity of human languages deep in the protocol stack, I agree but nobody suggested to force Gemini software to understand languages, only scripts. > But even the layer just below that (the competent user level) this > starts leaking. A gemini url starts with "gemini://" - that is ascii > text, and even funnier, taken from latin. If a non-english user is > confused by english (nay, latin, with no native speakers at all) > words, then surely "gemini://" has to be rewritten as "tweling://" > or "zwilling://" or whatever farsi, japanese or mongolian use for > "twin". If not, then an full ascii text url should be manageable > too... The Web solved the problem by making the URI scheme optional. I don't know Gemini clients who complete the URI with "gemini://" if it's missing but it is a possible approach. > an url is primarily a computer address. This is clearly false. URI are both a technical identifier (like an IP address or an address in memory) *and* a text seen by humans and displayed in TV ads, business cards, spoken over the phone, etc. Unlike addresses, they have to be internationalized. (Nobody would use the Web if HTTP URIs were really addresses.) > Long ago I came across a version of (I think it was) Pascal had been > localised into french with language keywords like "begin" and "if" > replaced. I am sure somebody can justify this somehow, but I thought > this was an impediment to interoperability, and view the > internationalising of computer protocols (as opposed to the user > interfaces) in a similar way. The idea is to have much more users than page authors and much more page authors than programmers. Internationalizing programming languages is a different issue, since programmers are a smaller group, of professionals.
On Thu, Dec 24, 2020 at 6:49 AM marc <marcx2 at welz.org.za> wrote: > Note how the global telephone system has made it into the furthest > corners of the planet - arguably further than the internet, and did > so without worrying about internationalisation relating to their > URL equivalents (phone numbers)... > As someone who grew up actually rotating a dial to enter 7, 10, or (for international calls) 15 digits, and looking them up in a paper booklet when I hadn't memorized them, the user experience *sucked*. Rectangular dials are quicker, but otherwise not that much easier to use. You could get a name-to-number mapping by voice if you had enough details (typically a postal address), but that is increasingly useless except for reaching a business. So what we have now is a system where numbers are universal and the associated names are purely local. The insinuation was that internationalised URLs are essential > because people who don't speak english at all might not be > able to comprehend or (if their input system is sufficiently > different) generate ascii/latin text. > I think that is not the point at all. In general, anglophones don't want URLs that are completely meaningless: domain names generally have meaning and so do path names and file names (consider gemini:// gemini.circumlunar.space/docs/companion/robots.gmi, for example, which tells you a lot about the document it identifies). But if they are in the wrong script, ??? ?? ??? ?? ?? ??? ?? ?????????? ?? ???? ?? ?????????. In addition, ?? k?nv?n??nz ?v tr?nzl?t?re???n ?r n?t n?s?s?rili k?ns?st?nt bitwin pip?l or k?ntriz. > The desire to be inclusive is good, but we are deferential > to pretty recent concept/meme - the monolingual nation state, > which is say 200 or 300 years old. Nid yw pob gwlad yn defnyddio un iaith yn unig. (Not all countries use only one language). > In this regard having people know learn a new language to interact > with the internet isn't that much of an imposition, but a return > to the way things were... just scaled up to the size of the planet. > In imperio Romanorum, facilis est negotiator Romanus quam Gallus sive Germanus, because the Roman grew up knowing the language of trade. Likewise the anglophone today. Isn't that yet another hint ? That the point of a language is to > communicate, not to serve as a barrier, despite the machinations > of nationalists ? > 'M?lin eru h?fu?einkenni ?j??anna' ? Languages are the chief distinguishing > marks of peoples. No people in fact comes into being until it speaks a > language of its own; let the languages perish and the peoples perish too, > or become different peoples. But that never happens except as the result of > oppression and distress.' > These are the words of a little-known Icelander of the early nineteenth > century, Sj?ra Tomas S?mundsson, He had, of course, primarily in mind the > part played by the cultivated Icelandic language, in spite of poverty, lack > of power, and insignificant numbers, in keeping the Icelanders in being in > desperate times. But the words might as well apply to the Welsh of Wales, > who have also loved and cultivated their language for its own sake (not as > an aspirant for the ruinous honour of becoming the lingua franca of the > world), and who by it and with it maintain their identity. --J.R.R. Tolkien, who was the furthest thing possible from either a nationalist or an imperialist. This is less true in Wales than it was when Tolkien wrote it, but the point is the same. -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201224/44ff 0be6/attachment-0001.htm>
> I briefly experimented with percent-encoded Japanese and Norwegian addresses on some of my capsules, but quickly gave up and went back to pure ASCII. > *Not* because typing in percent-encoded names was annoying, but because I realized how hard it was to verbally convey my Japanese addresses to my Norwegian friends and vice versa. The de facto universality of ASCII might something to embrace, not something to run away from, if we want to be serious about being inclusive. Verbally conveying addresses doesn't seem like a situation to optimize for; doesn't seem to happen so often, at least in my life as a Japanese-speaking internet user. Even among such occasions among future gemininauts, I conjecture that, most of the time, both parties will speak Japanese and the address can be quickly spelled out in Japanese. For end-users, reading, following and writing links probably will be the most likely ways you interact with URLs. 1. Read/follow links with a user-friendly name/title: If the URL is non-ascii: Encoding of the URL may not matter much, since it will be hidden. If the client is capable of showing the URL upon focus or something, showing it in unicode is far more accessible that percent-encoding 2. Read/follow links with bare URL: If the URL is non-ascii: more accessible to be able to read the URL in its non-ascii form 3. Write links to URLs that I control: More inclusive and convenient to be able to use and write URLs using the script that I'm used to. 4. Write links to URLs that I don't control: It'll be more accessible/convenient to be able to write the URL in non-ascii characters. Copying a non-ascii URL off of a web browser's address bar will probably percent-encode it (just tried it on desktop Chrome), but I shouldn't have to rely on such tools. While embracing ASCII may work when we have control over URLs we read and write, it falls short in terms of accessibility when linking to, say, Wikipedia, which uses non-ascii page names. If the aim is to support i18n/inclusivity as a principle/ideal/a 100% thing, adopting standards such as IRI/IDN(/ASCII) may make sense; if the motivation is out of practical concerns (whether people will find themselves reading and writing non-ascii URLs a lot and we want to make their lives easier in that case), having clients percent-encode path components before sending requests may suffice for now..? >From my standpoint, chances/expectations of a particular component of a URL having non-ascii characters: - protocol: none - domain: 2% of the time (8.3 million IDNs [1] / total domain names 370.7 million [2]) - but, for me, nearly none in practice. I suppose it depends on the person - path/query/fragment: fairly often, since I use (Japanese) Wikipedia a lot [1] https://idnworldreport.eu/ (2020 Q1) [2] https://www.verisign.com/en_US/domain-names/dnib/index.xhtml (2020 Q3) -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201225/c9e4 369d/attachment.htm>
On Fri, Dec 25, 2020 at 01:02:32AM -0800, spinner wrote: > Verbally conveying addresses doesn't seem like a situation to optimize for; > doesn't seem to happen so often, at least in my life as a Japanese-speaking > internet user. Even among such occasions among future gemininauts, I > conjecture that, most of the time, both parties will speak Japanese and the > address can be quickly spelled out in Japanese. Verbally conveying URLs and usernames is a situation I find myself in at least monthly... even more often before COVID. When both parties speak the same language, sure, it's more or less fine, but trying to explain an address with a character the other party has no idea how to input is an exercise in frustration. > For end-users, reading, following and writing links probably will be the > most likely ways you interact with URLs. > > 1. Read/follow links with a user-friendly name/title: If the URL is > non-ascii: Encoding of the URL may not matter much, since it will be > hidden. If the client is capable of showing the URL upon focus or > something, showing it in unicode is far more accessible that > percent-encoding Agreed. This can be handled *today* by clients with no change in the protocol. > 2. Read/follow links with bare URL: If the URL is non-ascii: more > accessible to be able to read the URL in its non-ascii form Agreed. Again, this can be handled *today* by clients with no change to the protocol. > 3. Write links to URLs that I control: More inclusive and convenient to be > able to use and write URLs using the script that I'm used to. > 4. Write links to URLs that I don't control: It'll be more > accessible/convenient to be able to write the URL in non-ascii characters. I'd actually say it's just *slighty* more convenient. In most cases you'll be copying and pasting the URL. If the gemini community feels that a breaking change that increases the complexity of implementing servers
On Thu, Dec 24, 2020 at 12:48:50PM +0100, marc <marcx2 at welz.org.za> wrote a message of 91 lines which said: > In this regard having people know learn a new language to interact > with the internet isn't that much of an imposition, Specially if it is *my* script. Imagine a chinese person asking that all URI be in chinese characters because people can learn a new script, after all. I bet that many proponents of ASCII URIs would not be so happy.
On Thu, Dec 24, 2020 at 10:49:21PM +0900, bie <bie at 202x.moe> wrote a message of 48 lines which said: > but because I realized how hard it was to verbally convey my > Japanese addresses to my Norwegian friends and vice versa. I don't see the point, anyway. If the adresse (the URI) uses the Japanese writing, it is probably because the content is in Japanese and/or is interesting only for people who are in Japan. Therefore either your norwegian friend is in one of these two cases, or you wouldn't tell him/her the adress, anyway. > The de facto universality of ASCII No, the latin script (and even more the ASCII character set) is not universal (even if it would be simpler for me).
On Wed, Dec 23, 2020 at 09:26:24PM +0100, cage <cage-dev at twistfold.it> wrote a message of 17 lines which said: > > The individual path segments actually. > > > > So, given /Foo/Bar/Baz, decompose the path into individual > > segments Foo, Bar, and Baz, encode these, and reconstruct the > > path. Easy-peasy. I don't think that percent-encoding has to be done per path segment. I don't find anything in RFC 3986 that makes your algorithm mandatory. "/" is a safe character, anyway so it seems to me that you can percent-encode the entire path in one operation.
On Fri, Dec 25, 2020 at 05:57:56PM +0100, Stephane Bortzmeyer wrote: Hi! > On Wed, Dec 23, 2020 at 09:26:24PM +0100, > cage <cage-dev at twistfold.it> wrote > a message of 17 lines which said: > > > > The individual path segments actually. > > > > > > So, given /Foo/Bar/Baz, decompose the path into individual > > > segments Foo, Bar, and Baz, encode these, and reconstruct the > > > path. Easy-peasy. > > I don't think that percent-encoding has to be done per path segment. I > don't find anything in RFC 3986 that makes your algorithm > mandatory. "/" is a safe character, anyway so it seems to me that you > can percent-encode the entire path in one operation. Please correct me if i am wrong so this means that if given a path like: "/?/?/c" it is safe to send to the server "%2F%C3%A8%2F%C3%A0%2Fc" instead of "/%C3%A8/%C3%A0/c" I can see that percent-decoding both the two string above returns the same results: the first path. Could this be the reason because no splitting is needed? Bye! C.
> On Dec 25, 2020, at 13:07, bie <bie at 202x.moe> wrote: > > All that said, I'll make another attempt at leaving this discussion (and the mailing list) again... Still don't get it. You have a perfectly functional gemini server, written in C, using but the OpenBSD base system. Fabulous. Furthermore, you sail through your English-Japanese-Norwegian workflow by the simple expedient of transliterating all identifiers to US-ASCII, ala Unidecode!. Terrific. To top it all, you can dictate the resulting identifiers, in plain English, over a rotary phone line to your trilingual Japanese-Norwegian friends. Much excellent. All in all, everything is covered. Nothing to add. Nothing to take away. All set. If tomorrow, Gemini adopts IRIs, nothing changes for you. Your setup is fully upward compatible. You do not have to lift a finger to keep going. All stays exactly the same for you. Of course, no one on your setup can use IRIs. Only URIs. But they don't want IRIs anyway. No loss. Arguably, your setup may not be fully compliant with the letter of the spec. No big deal. 99% there. No one is going to sue you. Just a hobby. But it's working. Today. For your needs. On the other hand, it doesn't work for me. I do not like transliteration. I want native. I want my Kabuki file to be named ?.gmi, and not kabuki.gmi, nor xn--7q8h.gmi, nor %F0%9F%91%B9.gmi. Nor any other weird encodings. ?.gmi it is. I do not want to type ?.gmi. I want to copy & paste. I do not type identifiers by hand, nor do I dictate them over the phone. Ever. It's error prone. And annoying. But's that me. I do not want to be dragged to the lowest of the lowest common denominator just because you cannot be bothered to support Unicode. But that's just me. What I want is Unicode. Because I like to name my file ?.gmi. It's 2020. And it's important to me. Moving to IRIs allows me to use Unicode file names. While not breaking anything on your side. Staying with URIs prevents me from doing what I want. While not changing anything for you. Why do you want to prevent me from using the names I want? I do not tell you how to name your files. Why do you want to tell me? This could be construed as rude.
> On Dec 25, 2020, at 22:10, Stephane Bortzmeyer <stephane at sources.org> wrote: > > On Fri, Dec 25, 2020 at 10:07:37PM +0100, > Petite Abeille <petite.abeille at gmail.com> wrote > a message of 8 lines which said: > >>> I don't think that percent-encoding has to be done per path segment. >> >> Reserved Characters gen-delims "/" > > So? You are not meant to encode the path separator if you would like to preserve the path semantic. If you do, you turn the entire path into one segment. Which is certainly not the desired effect most of the time.
> On Dec 25, 2020, at 20:55, cage <cage-dev at twistfold.it> wrote: > "/?/?/c" > > it is safe to send to the server > > "%2F%C3%A8%2F%C3%A0%2Fc" > > instead of > > "/%C3%A8/%C3%A0/c" Those are two different paths. The first one has one segment, with encoded separators. The second one has 3 segments, properly encoded. Which matches the semantic of your original path, which sport 3 segments ?, ?, and c.
> On Dec 24, 2020, at 12:48, marc <marcx2 at welz.org.za> wrote: > > deferential to pretty recent concept/meme - the monolingual nation state ( wat? ) If I wish, out of juvenile impertinence, to name my file ?.gmi then I should be able to do so without further ado. I would find it patronizing to be forced to type %F0%9F%96%95.gmi .
> > 4. Write links to URLs that I don't control: It'll be more > > accessible/convenient to be able to write the URL in non-ascii characters. > > I'd actually say it's just *slighty* more convenient. In most cases > you'll be copying and pasting the URL. I realized the original list missed one distinction: reading links using a client as a reader vs reading links using an editor as a content author. So the initial authoring may be done through copy-pasting, but revisiting that piece afterwards can leave you unsure about exactly which URL a link is pointing to, if it's all percent-encoded. -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201225/6031 5514/attachment.htm>
> On Dec 22, 2020, at 16:13, Solderpunk <solderpunk at posteo.net> wrote: > > Okay, I'm finally getting involved in this discussion. Thought exercise: Each time you see URI, replace it with MORSE CODE. Each time you see IRI, replace it with ASCII. Debriefing at noon.
> On Dec 23, 2020, at 03:54, bie <bie at 202x.moe> wrote: > > doesn't have to be able to parse JPEGs to serve images. ( wat? ) Presently text/gemini mandates 3 different encoding in a link: => gemini://punicode/url-encoded utf8 3 different encodings in one line. 3 in 1. Moving to IRI clean this up to 1 encoding, utf8. 1 in 1.
Just want to let you know that your email client does not properly send an In-Reply-To header and breaks threading. -- Leo
> On Dec 26, 2020, at 00:29, Leo <list at gkbrk.com> wrote: > > Just want to let you know that your email client does not properly send > an In-Reply-To header and breaks threading. This is not my experience, but I would be happy to be proven wrong, and fix it if necessary. Could you be more specific? What does break, exactly?
Petite Abeille <petite.abeille at gmail.com> writes: >> On Dec 23, 2020, at 03:54, bie <bie at 202x.moe> wrote: >> >> doesn't have to be able to parse JPEGs to serve images. > > ( wat? ) > > Presently text/gemini mandates 3 different encoding in a link: > > => gemini://punicode/url-encoded utf8 > > 3 different encodings in one line. > > 3 in 1. > > Moving to IRI clean this up to 1 encoding, utf8. > > 1 in 1. Not really. I don't know basically anything about punycode so I can't comment on that, but IRI allows percent encoding too.
bie <bie at 202x.moe> writes: > > You're kind of correct in the sense that if we just treat the request as > arbitrary bytes and not as an IRI (no validation, no handling at all), > it's simple, but I don't think that's the right way to look at this > issue. Instead, it's about the complexity of proper URI handling vs > proper IRI handling. Not to mention that IRIs can still have > percent-encoded characters! Sorry if it took long for the reply, but I took some time to fix up my server and now here I am :) Originally, when I wrote my server I did a really simple routine to extract the path from a url and that's it. (plus minor checking) This wasn't good, of course. In the last two days I took the time to write first a proper URL parser[0], and than extending it to support IRIs[1]. Turns out, once you have a URL parser (not hard to do at all), you almost have a complete IRI parser. As Sean wrote, you basically have to replace the unreserved rule to allow other utf8 characters and you're done. And even if you're uncomfortable doing this, the RFC lists the valid ranges, so adding a couple of checks isn't the end of the world (if you want to be 100% compliant, whatever that means). (And all of this comes from one that has never, ever, implemented a IRI/URI parser before, that has read for the first time the rfc3986 while writing the code and has successfully -- I believe -- implemented a full IRI parser in less than 500 lines of C, with comments and everything, without using anything other than the standard library. Heck, the parser doesn't even allocates memory.) > After thinking about this for a while, the biggest issue for me is that > this is a breaking change. Breaking in the sense that it breaks *every > single compliant server we already have*! If gemini, which has been > surprisingly good at resisting breaking spec changes, accepts this, I > don't see any reason to believe that it won't happen again and again, > for equally silly reasons. > > bie I don't buy this argument. It's not like tomorrow we won't be able to browse gemini unless we update clients/servers. Valid URI are also valid IRI, so it's not an armageddon. The whole thing started (IIRC) because the spec says "UTF8 URI". Furthermore, the spec isn't finalised yet (see for instance the change regarding full url vs relative ones in the requests). If you wrote your server for you, you probably won't need to change anything: from what you wrote, I assume you're serving only files whose names are ASCII only, so unless you want to host things with funny names, you're probably good. Anyway, sorry for the long reply, I didn't want to drag this discussion too much, really. Let's see what will be decided :) Cheers! [0]: https://github.com/omar-polo/gmid/commit/33d32d1fd66a577f22f3f33f238e8dac44ec9995 [1]: https://github.com/omar-polo/gmid/commit/df6ca41da36c3f617cbbf3302ab120721ebfcfd2
> On Dec 26, 2020, at 01:28, Omar Polo <op at omarpolo.com> wrote: > > Not really. I don't know basically anything about punycode so I can't > comment on that, but IRI allows percent encoding too. Surprise me. Here is my IRI: gemini://?/?.gmi Show me your URI, and then, justify why it a good thing for me.
Petite Abeille <petite.abeille at gmail.com> writes: >> On Dec 26, 2020, at 01:28, Omar Polo <op at omarpolo.com> wrote: >> >> Not really. I don't know basically anything about punycode so I can't >> comment on that, but IRI allows percent encoding too. > > Surprise me. Here is my IRI: > > gemini://?/?.gmi > > Show me your URI, and then, justify why it a good thing for me. Sorry, I wasn't saying that gemini://?/%F0%9F%91%B9.gmi [0] is better than gemini://?/?.gmi (it is not). Rather, than even with IRIs you don't want to delete the percent-decoding code in your parser. [0] idn2 refuses to punycode ? :/
> On Dec 26, 2020, at 01:41, Omar Polo <op at omarpolo.com> wrote: > > Sorry Let's recap: IRI gemini://?/?.gmi URI gemini://xn--el8h/%F0%9F%91%B9.gmi ? reserved characters which always needs to be encoded, irrespectively of any other consideration.
bie <bie at 202x.moe> writes: > After thinking about this for a while, the biggest issue for me is > that this is a breaking change. Breaking in the sense that it breaks > *every single compliant server we already have*! I think that's a little dramatic. Looking at my server, I need to make a change in exactly one place: when mapping IRI paths to file paths, I can no longer use the url decoding library I was using to decode URI paths, because it mangles Unicode characters. But since I can now be sure that only IRI reserved characters are encoded, I can just do a simple substring substitution. It's also a change that is backward-compatible with old clients. -- Jason McBrayer | ?Strange is the night where black stars rise, jmcbray at carcosa.net | and strange moons circle through the skies, | but stranger still is lost Carcosa.? | ? Robert W. Chambers,The King in Yellow
> gemini://?/?.gmi Must we permit question marks in Gemini domains at all? -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201225/34c7 469a/attachment.htm>
> On Dec 26, 2020, at 03:54, Steve Phillips <steve at tryingtobeawesome.com> wrote: > > Must we permit question marks in Gemini domains at all? -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201226/0285 5a15/attachment-0001.htm> -------------- next part -------------- A non-text attachment was scrubbed... Name: nq050816.gif Type: image/gif Size: 29260 bytes Desc: not available URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201226/0285 5a15/attachment-0001.gif>
On Sat Dec 26, 2020 at 1:32 AM CET, Omar Polo wrote: > In the last two days I took the time to write first a proper URL > parser[0], and than extending it to support IRIs[1]. Turns out, once > you have a URL parser (not hard to do at all), you almost have a > complete IRI parser. As Sean wrote, you basically have to replace the > unreserved rule to allow other utf8 characters and you're done. And > even if you're uncomfortable doing this, the RFC lists the valid ranges, > so adding a couple of checks isn't the end of the world (if you want to > be 100% compliant, whatever that means). > > (And all of this comes from one that has never, ever, implemented a > IRI/URI parser before, that has read for the first time the rfc3986 > while writing the code and has successfully -- I believe -- implemented > a full IRI parser in less than 500 lines of C, with comments and > everything, without using anything other than the standard library. > Heck, the parser doesn't even allocates memory.) This is, more and more, how I'm conceptualising things. Parsing/validating IRIs is not actually remotely difficult at all. Algorithmically it's an extremely minor change to parsing/validating URIs. The apparent pain exists only because the world has apparently been very slow about packaging code up for this into major libraries/languages, probably because HTTP's ASCII-only nature reduces demand. If we adopt IRIs, I would actually encourage Gemini software authors who find their language lacking tools for this not to write custom code for it that lives only in their software, but to actually try to get the functionality accepted upstream into standard libraries, or widely used third-party libraries. This is generally useful functionality that's in no way Gemini-specific, and having easy support for it everywhere makes the world a better place regardless of whether Gemini thrives or declines. I don't really think the alleged difficulty of handling IRIs is a good argument against accepting them. I'm now more interested in learning/thinking about normalisation issues, which have been relatively under discussed so far. It's possible this is where the real trouble lies. Breaking a UTF-8 IRI up into (scheme, authority, path) is not a substantial hurdle. Cheers, Solderpunk
On Thu Dec 24, 2020 at 12:48 PM CET, marc wrote: > Note how the global telephone system has made it into the furthest > corners of the planet - arguably further than the internet, and did > so without worrying about internationalisation relating to their > URL equivalents (phone numbers)... This is not really a compelling comparison at all. Even if different languages and cultures use different words and symbols for numbers, the overwhelming majority of them use base 10, meaning there is a straightforward and unambiguous mapping between them all. Almost anybody can read/write and say/hear a phone number in their native language, making it much easier to memorise and transmit them. I don't know if it was ever done (I wouldn't be at all surprised if it was), but it would be no technical problem at all to manufacture either a DTMF or rotary phoneset which had ?, ?, ?, etc. printed on it instead of 1, 2, 3 and have it work correctly anywhere on Earth. Even if this couldn't be done and people had to learn a foreign system of numeric symbols, the fact that there's only 10 of them and that they map directly to native equivalents makes them much easier to learn. And, of course, the Arabic numeral system was already widely used across many languages and cultures before the phone system arrived, and people in all cultures already had practice reading, writing and memorising numeric values for many other reasons (calendars and money are ancient technology). Cheers, Solderpunk
On Fri, Dec 25, 2020 at 11:16:08PM +0100, Petite Abeille wrote: Hi! > > > On Dec 25, 2020, at 20:55, cage <cage-dev at twistfold.it> wrote: > > "/?/?/c" > > > > it is safe to send to the server > > > > "%2F%C3%A8%2F%C3%A0%2Fc" > > > > instead of > > > > "/%C3%A8/%C3%A0/c" > > Those are two different paths. > The first one has one segment, with encoded separators. The second > one has 3 segments, properly encoded. Which matches the semantic of > your original path, which sport 3 segments ?, ?, and c. This makes sense to me! Thanks! C.
On Thu Dec 24, 2020 at 3:29 PM CET, Stephane Bortzmeyer wrote: > Once the specification is "officially" "final", > this will be different. AFAIK, it is not the case (otherwise, what > would be the point of the [spec] topic?) Anybody could be forgiven for not inferring it from actually looking at what gets posted to [spec], but the main reason there's a venue for discussing spec finalisation at all is that there are still lots of things to be done at the level of "crossing t's and dotting i's". For example, somebody recently reminded me off-list that the spec is still silent on the question of whether or not servers need to use TLS's close_notify mechanism once they're done sending a response, or whether it's okay to simply close the TCP connection. Or see also the fragment related stuff that people have been posting about Stuff like this, that is to say small but important technical details and edge cases, is actually what I consider the most important task of the [spec] topic. This stuff ought to be finalised before a formal RFC-style specification can be written up and potentially submitted to IETF. The possible change to using IRIs is *by far* the most major change I have considered making in probably a year. I do not expect to ever consider anything this large again, ever (meaning that the fear of adopting IRIs being a slippery slope to more drastic changes in the future is unfounded). I'm taking my time on it because it's a major change and because internationalisation is IMHO a very important issue, but make no mistake - I cannot *wait* for it to done so we can focus on the smaller, hopefully much less contentious, details and get the whole thing finalised. I really consider Gemini 100% complete in terms of scope/capabilities. People are doing wonderful things with it as is, it's basically everything I ever dreamed of. I am very ready to transition to spending 10 x more time and energy reading and writing Gemini content than managing the protocol. Cheers, Solderpunk
> This is, more and more, how I'm conceptualising things. > Parsing/validating IRIs is not actually remotely difficult at all. > Algorithmically it's an extremely minor change to parsing/validating > URIs. The apparent pain exists only because the world has apparently > been very slow about packaging code up for this into major > libraries/languages, probably because HTTP's ASCII-only nature reduces > demand. If we adopt IRIs, I would actually encourage Gemini software > authors who find their language lacking tools for this not to write > custom code for it that lives only in their software, but to actually > try to get the functionality accepted upstream into standard libraries, > or widely used third-party libraries. This is generally useful > functionality that's in no way Gemini-specific, and having easy support > for it everywhere makes the world a better place regardless of whether > Gemini thrives or declines. > > I don't really think the alleged difficulty of handling IRIs is a good > argument against accepting them. I'm now more interested in > learning/thinking about normalisation issues, which have been relatively > under discussed so far. It's possible this is where the real trouble > lies. Breaking a UTF-8 IRI up into (scheme, authority, path) is not a > substantial hurdle. This is enough of a decision for me, so I'm out. I'm not one to stand in the way of "progress", however misguided, so I've taken down my 4 gemini servers. bie
On Sat Dec 26, 2020 at 4:12 PM CET, bie wrote: > This is enough of a decision for me, so I'm out. I'm not one to stand in > the way of "progress", however misguided, so I've taken down my 4 gemini > servers. I'm, genuinely and sincerely, sorry to hear this. Thanks for having run them for the time you did. The final decision is still to be made and in the even that I end up backtracking on this, I hope you'll reconsider. Cheers, Solderpunk
> On Dec 26, 2020, at 16:12, bie <bie at 202x.moe> wrote: > > so I've taken down my 4 gemini servers. This is your prerogative. Sounds like a tantrum though. Disappointing both ways. C'est la vie.
> On Dec 26, 2020, at 16:56, Petite Abeille <petite.abeille at gmail.com> wrote: > > tantrum rage-quit from 6 years old on: rage-quit, verb, INFORMAL, US, angrily abandon an activity or pursuit that has become frustrating, especially the playing of a video game. Now we know.
On Sat Dec 26, 2020 at 5:10 PM CET, Petite Abeille wrote: > > On Dec 26, 2020, at 16:56, Petite Abeille <petite.abeille at gmail.com> wrote: > > > > tantrum > > rage-quit from 6 years old on: > > rage-quit, verb, INFORMAL, US, angrily abandon an activity or pursuit > that has become frustrating, especially the playing of a video game. People are free to shutdown their servers whenever they want, whyever they want. There's no need to taunt them for it. Please just let it go. Cheers, Solderpunk
> On Dec 26, 2020, at 17:12, Solderpunk <solderpunk at posteo.net> wrote: > > People are free to shutdown their servers whenever they want, whyever > they want. Agree. > There's no need to taunt them for it. There is a qualitative difference in publicly "threatening" to do so: one can always move on quietly. > Please just let it go. Water under the bridge.
I'm pretty sure this is not true. In all cases (uri/iri), percent encoding is allowed for any character, so the server has to percent-decode paths segment before using them to match a file. Unless gemini goes for a more severe specification which only allows percent-encoding for reserved characters. But this may break a lot of client code.
> On Dec 26, 2020, at 18:19, C?me Chilliet <come at chilliet.eu> wrote: > > percent-decode paths segment Do try it and then report back your findings.
> On Dec 26, 2020, at 20:06, Petite Abeille <petite.abeille at gmail.com> wrote: > >> On Dec 26, 2020, at 18:19, C?me Chilliet <come at chilliet.eu> wrote: >> >> percent-decode paths segment > > Do try it and then report back your findings. For example, do try to roundtrip one path segment: "A/B Testing". Do show what happens in each case.
> Feedback welcome, especially if I've overlooked anything, which is > certainly possible. What I'd be most interested in hearing, at this > point, is client authors letting me know whether the standard library > in the language their client is implemented in can straightforwardly: > > 1. Parse and relativise URLs with non-ASCII characters (so, yes, okay, > technically not URLs at all, you know what I mean) in paths and/or > domains? > 2. Transform back and forth between URIs and IRIs? > 3. Do DNS lookups of IDNs without them being punycoded first? You can > test this with r?ksm?rg?s.josefsson.org. The main language I use for Gemini software is Go. My clients, Amfora and gemget, are both programmed using Go, and they use Go's built-in URL library, called "net/url". This library cannot properly handle 1, 2, or 3. This likely because the Go stdlib is high quality, and appears to be coded to follow RFCs very strictly, and the library was only designed to support URLs, and not IRIs. For example, it will accept invalid characters in the path when parsing the URL, but when converting it back into a string, it will percent-encode the invalid characters. This does not happen with the query string, though. The fact that paths and query strings are treated differently makes converting IRIs to URIs not straightforward. And doing the reverse would require taking the bits of the parsed URL and then decoding them compliantly, and then stitching them together manually. As for #3, the Go stdlib looks up the domain in the URL as-is, and will not punycode anything. I have had to do it myself, which was annoying but not super difficult. Amfora and gemget both have support for IDNs. See the link below for how IDN support was added, if it's of interest. https://github.com/makeworld-the-better-one/go-gemini/compare/a557676343c51 dabbc7d5a112d38bb8095db94d7...2f79af7688e88942d0d51d6ed65617b68a91a733 I believe these difficulties have implications on whether or not IRIs should be added to the spec, but I'd rather let this email and the facts of the matter stand on their own. makeworld
On Fri, Dec 25, 2020 at 10:54:17PM +0100, Petite Abeille wrote: > If tomorrow, Gemini adopts IRIs, nothing changes for you. Your setup is fully upward compatible. You do not have to lift a finger to keep going. > > All stays exactly the same for you. > > Of course, no one on your setup can use IRIs. Only URIs. But they don't want IRIs anyway. No loss. > > Arguably, your setup may not be fully compliant with the letter of the spec. No big deal. 99% there. No one is going to sue you. Just a hobby. Been there, done that. Best viewed with browser %s. Several such "improvements", and client that was first-class citizen becomes something like w3m or lynx in modern web, and another dream is ruined in pursue for aesthetics.
> On Dec 26, 2020, at 21:38, Dmitry Bogatov <gemini#lists.orbitalfox.eu#v1 at kaction.cc> wrote: > > Been there, done that. Best viewed with browser %s. Fair point, But this concerns a server. Not a client. The server in question will never handle Unicode identifiers. Nor does it need to. Zero practical impact. Plus, really, we are already in the "best viewed with xyz" age. Compare and contrast, say, LaGrange? and Amphora?. They are both great. In their own different ways. Are you suggesting a normative user experience? ? https://github.com/skyjake/lagrange ? https://github.com/makeworld-the-better-one/amfora
On Fri, Dec 25, 2020 at 11:58 AM Stephane Bortzmeyer <stephane at sources.org> wrote: > I don't think that percent-encoding has to be done per path segment. I > don't find anything in RFC 3986 that makes your algorithm > mandatory. "/" is a safe character, anyway so it seems to me that you > can percent-encode the entire path in one operation. > Once any necessary punycoding has been done (look for // on the left and either / or end-of-string on the right), the whole URI can have its non-ASCII characters %-encoded all at once. -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201226/31af f2f2/attachment.htm>
> On Dec 26, 2020, at 22:10, John Cowan <cowan at ccil.org> wrote: > > the whole URI can have its non-ASCII characters %-encoded all at once Right. But that was not Stephane problematic, which was related to how to encode Reserved Characters gen-delims "/" in a path. Consider the following 3 path segments: "Research", "A/B Testing", "Results". Stephane asserts the following encodings are equivalent: Research%2FA%2FB%20Testing%2FResults vs. Research/A%2FB%20Testing/Results They are clearly not. The first variant will result in one path segment, with data loss. While the second one will preserve the original semantic, with 3 segments, individually encoded, and intact. They are not equivalent path. Try it in your favorite library.
> On Dec 26, 2020, at 21:25, colecmac at protonmail.com wrote: > > This likely because the Go stdlib is high quality, and appears to be coded to follow RFCs very strictly, and the library was only designed to support URLs, and not IRIs. Would a strict ANTLR grammar for IRI help? https://pkg.go.dev/bramp.net/antlr4/iri https://github.com/antlr/grammars-v4/blob/master/iri/IRI.g4
> On Dec 26, 2020, at 21:25, colecmac at protonmail.com wrote: > > appears to be coded to follow RFCs very strictly Looking at url/url.go, the implementation seems more pragmatic than dogmatic: https://github.com/golang/go/blob/master/src/net/url/url.go A fascinating read. But a strict RFC grammar it's not. It has gone through a long and tumultuous history: https://github.com/golang/go/commits/49a210eb87da6b7ac960cac990337ef4dc113b 0d/src/net/url/url.go Either way, would preprocessing an IRI into an URL help? Similarly to the Java situation.
It was thus said that the Great Petite Abeille once stated: > > On Dec 26, 2020, at 22:10, John Cowan <cowan at ccil.org> wrote: > > > > the whole URI can have its non-ASCII characters %-encoded all at once > > Right. But that was not Stephane problematic, which was related to how to > encode Reserved Characters gen-delims "/" in a path. > > Consider the following 3 path segments: "Research", "A/B Testing", > "Results". > > Stephane asserts the following encodings are equivalent: > > Research%2FA%2FB%20Testing%2FResults > > vs. > Research/A%2FB%20Testing/Results > > They are clearly not. The first variant will result in one path segment, > with data loss. While the second one will preserve the original semantic, > with 3 segments, individually encoded, and intact. > > They are not equivalent path. Try it in your favorite library. It was interesting to see the Go URL library you linked to. For your two examles, it will return the following structures: { Path = "Research/A/B Testing/Results", RawPath = "Research%2FA%2FB%20Testing%2FResults", } { Path = "Research/A/B Testing/Results", RawPath = "Research/A%2FB%20Testing/Results", } and it's up to the client to check RawPath if it's *really* necessary to make the distinction (meaning---the client *still* has to parse the path). A more normal example like "Research/ABTesting/Results" will result in: { Path = "Research/ABTesting/Results", RawPath = "", } so it's not like RawPath will always have the path. For the record, my own URL parsing library will just return Research/A/BTesting/Results for both samples. I found it easier to work with that than what I was doing at the time (pedantically correct, painfully hard to use in practice). You would be hard pressed to actually create a file named "A/B Testing" on any file system I know of (and not have it be "B Testing" in the "A" directory). If there *is* a file system that allows slashes in a filename (and not just a seperator between directories) than I might revisit my decision, but until then ... -spc
> On Dec 27, 2020, at 00:22, Sean Conner <sean at conman.org> wrote: > > You would be hard pressed to actually create a file named "A/B Testing" on any file system I know of There is more to life than a file system, a database for example. Let's not conflate the limitations of the two.
> On Dec 27, 2020, at 00:22, Sean Conner <sean at conman.org> wrote: > > For the record, my own URL parsing library will just return > > Research/A/BTesting/Results Tragic. I take back my assessment of your LPEG grammar. It's clearly wrong. Oh well.
It was thus said that the Great Petite Abeille once stated: > > > > On Dec 27, 2020, at 00:22, Sean Conner <sean at conman.org> wrote: > > > > For the record, my own URL parsing library will just return > > > > Research/A/BTesting/Results > > Tragic. I take back my assessment of your LPEG grammar. It's clearly > wrong. Oh well. Okay, given your two examples: Research%2FA%2FB%20Testing%2FResults Research/A%2FB%20Testing/Results what should a "proper" URL parser return? And how should client code handle such a construct? Perhaps even attempt to write a URL (or IRI) parser yourself? At one point, my URL parser would return the following for these: { path = { "Research/A/B Testing/Results", } } { path = { "Research", "A/B Testing", "Results", } } but I found working with such paths to be painful. First off, how to distinguish between Research/A%2FB%20Testing/Results and /Research/A%2FB%20Testing/Results How would I specify that any URL with a path starting with "/foo" be redirected to a path starting with "/bar"? /foo/this -> /bar/this /foobar -> /barbar And how would I deal with this in the code? Yes, you can say I ruined the purity of my URL parser with an ugly pragmatic approach (keep the path a string, but decoded and ignore the semantics of encoded delims), but there's also the saying, "Perfect is the enemy of good." -spc [1] https://en.wikipedia.org/wiki/Perfect_is_the_enemy_of_good
> On Dec 27, 2020, at 00:52, Sean Conner <sean at conman.org> wrote: > > "Perfect is the enemy of good." Agree. My own parserss are on the pragmatic side of the spectrum (even though path segments are preserved, as I tend to use databases rather than file systems). I was hoping you where a better person that I'm, to borrow your own line. I suspect I should stop hoping for a full-fledge LPEG grammar for MIME emerging from Conman's lab :/ Oh well. We are all flawed. Skynet will just crash and segfault. No one cares. Even on a mailing list dedicated to designing a protocol, one ends up being "pedantic". I now fell the same rage-quit as bie. On the plus side, next time someone dare to mention any RFCs, just punch them in the face. Life is too short. Let's stop pretending.
It was thus said that the Great marc once stated: > > I have to deal with the telephony network at work. It *is* the OSI seven > > layer burrito [1] and even *there* there are baked in assumptions relating > > to i18n [2]. Text is limited to ASCII. Yup. 7-bit US-ASCII it all its > > glory. Anything else requires some very nasty hacks. > > Note how the global telephone system has made it into the furthest > corners of the planet - arguably further than the internet, and did > so without worrying about internationalisation relating to their > URL equivalents (phone numbers)... Phone numbers are their own special Hell [1]. The point I was making is that yes, the SS7 protocol, used by telephone companies around the world, isn't i18n clean. And it's not like SS7 was developed in the 1920s ... that's all I'm saying here. I work for a company that translates phone numbers (like 800-555-1212) to human readable names (like "The ACME Company") for delivery to the cell phone receiving a phone call (so intead of getting "800-555-1212" you get "The ACME Company"). It was a tremendous amount of engineer to work around the SS7 limitations of 15 US-ASCII characters (and it's a hack really). But hey, it's US-ASCII only, so it's "simple" ... -spc [1] I have to deal with phone numbers as given to us by the Oligarchic Cell Phone Companies. You would think that we would be given valid phone numbers as defined by them, but you would be wrong. We get complete trash along with good. And then my manager's manager wants us to pass along all invalid NANP [2] numbers along with the valid NANP numbers [3], while excluding all valid international numbers ... [2] North America Numbering Plan, which includes the US, Canada and the Carribean, but excludes Mexico and countries south of it. [2] Our product is only designed for the US. This makes it interesting because Canada and the Carribean aren't the US, but are part of the NANP, which means some "area codes" are actually "country codes" in disguise, but I digress ...
It was thus said that the Great Petite Abeille once stated: > > On Dec 27, 2020, at 00:52, Sean Conner <sean at conman.org> wrote: > > > > "Perfect is the enemy of good." > > Agree. My own parserss are on the pragmatic side of the spectrum (even > though path segments are preserved, as I tend to use databases rather than > file systems). How do you preseve them? As the encoded "%2F"? Do you convert the encoded values to uppercase? Lowercase? Keep them the same? > I was hoping you where a better person that I'm, to borrow > your own line. But you said it yourself, you fall on the pragmatic side. > I suspect I should stop hoping for a full-fledge LPEG grammar for MIME > emerging from Conman's lab :/ Well, I do have one [1], although I'm not sure how "full-fledged" it is. I also lowercase the actual MIME type (so "TEXT/PLAIN" will become "text/plain") to make it easier to use the results. I even have one for email [2], which can even parse RFC-822 style email addresses [3], but I'm rethinking how I parse Internet messages as I'm not entirely happy with my current approach. > Oh well. We are all flawed. Skynet will just crash and segfault. > > No one cares. Even on a mailing list dedicated to designing a protocol, > one ends up being "pedantic". > > I now fell the same rage-quit as bie. > > On the plus side, next time someone dare to mention any RFCs, just punch > them in the face. Life is too short. Life is too short to follow the WhatWG "standard" [4], so I guess it's a "pick your poison" type situtation. > Let's stop pretending. Yeah, let's roll our own crypto and addressing scheme! What can possibly go wrong? -spc [1] https://github.com/spc476/LPeg-Parsers/blob/master/mimetype.lua [2] https://github.com/spc476/LPeg-Parsers/blob/master/email.lua [3] Muhammed.(I am the greatest) Ali @(the)Vegas.WBA [4] https://url.spec.whatwg.org/#concept-url-parser
> On Dec 27, 2020, at 01:26, Sean Conner <sean at conman.org> wrote: > > How do you preseve them? As the encoded "%2F"? Do you convert the > encoded values to uppercase? Lowercase? Keep them the same? The hex code themselves? I tend to normalize them to uppercase. Tradition or something. But perhaps we are talking about different things? Or? > But you said it yourself, you fall on the pragmatic side. Yes. Different pragmatism I guess. Strictly speaking it also depend of the context. In general, failfast systems have better survivability odds. But at time, one would rather extract as much as one can from imperfect data. It depends of the problematic at hand. > Well, I do have one [1], although I'm not sure how "full-fledged" it is. Yes, but this only concerns itself with the content-type header. I mean MIME multipart constructs. > Life is too short to follow the WhatWG "standard" [4], so I guess it's a > "pick your poison" type situtation. Fair enough. > Yeah, let's roll our own crypto and addressing scheme! What can possibly > go wrong? Now, that would be crazy :)
It was thus said that the Great Petite Abeille once stated: > > > > On Dec 27, 2020, at 01:26, Sean Conner <sean at conman.org> wrote: > > > > How do you preseve them? As the encoded "%2F"? Do you convert the > > encoded values to uppercase? Lowercase? Keep them the same? > > The hex code themselves? I tend to normalize them to uppercase. Tradition > or something. But perhaps we are talking about different things? Or? I meant: If I gave your URL parsers the string Research/A%2fB%20Testing/Results what would I, as a user, get back? Would I get a string back? An array of segments? An actual example would be be nice. > > Well, I do have one [1], although I'm not sure how "full-fledged" it is. > > Yes, but this only concerns itself with the content-type header. I mean > MIME multipart constructs. Ah. See, I haven't needed that much functionality yet (and I suspect I could use my email parsers for that if I really needed it). -spc [1] Missing footnote.
> On Dec 27, 2020, at 02:19, Sean Conner <sean at conman.org> wrote: > > I meant: If I gave your URL parsers the string > > Research/A%2fB%20Testing/Results > > what would I, as a user, get back? Would I get a string back? An array of > segments? An actual example would be be nice. Ultimately, a list of path segments, yes. Similar to your first (correct) example. With both an absolute and directory indicator. In the case above, 3 segments. Not absolute, not a directory. These segments are then decoded to whatever string they represents, i.e. segment[ 2 ] would contain the string "A/B Testing". As originally provided. The URL can always round trip. If not, something is very wrong. The same problematic applies, to, say, representing an URL in an URL. So, given the following path segments, "cache", "gemini://host/path", and "content", the resulting path should be: cache/gemini%3A%2F%2Fhost%2Fpath/content And not: cache%2Fgemini%3A%2F%2Fhost%2Fpath%2Fcontent Which is clearly nonsensical. > Ah. See, I haven't needed that much functionality yet (and I suspect I > could use my email parsers for that if I really needed it). Yes, email.lua would handle a message/rfc822 part. It's a start :) https://github.com/spc476/LPeg-Parsers/blob/master/email.lua
---
Previous Thread: [spec] What to do of fragments when there is a redirection