💾 Archived View for gemi.dev › gemini-mailing-list › 000144.gmi captured on 2023-11-04 at 12:28:29. Gemini links have been rewritten to link to archived content
-=-=-=-=-=-=-
Greetings, I got a bug report recently for Bombadillo about how I have been handling query strings. I had accidentally left a big in where I was chaining queries: some.site/thing.gmi?hello?world They brought up, and it seemed clear to me, that the query should be replaced and not added on to. So I have done that update (and it will be a part of a big release soon, for which I will notify the list when it happens). The other thing they brought up was escaping before sending to the server. Bombadillo does not escape querystrings currently (for gopher or for gemini). A look at the spec provides no further explanation re: what is expected: escaped or unescaped? The spec does point to an RFC that no doubt encourages the escaping of querystrings (I have not read through the RFC recently)... However, in practice, servers offering search (GUS and Houston) do not seem to unescape a query that is sent to them. I have not tested places other than those two but neither seems to do so. If I send: some.site/thing.gmi?This%2C+is+a+string GUS reports the search string as: "This%2C+is+a+string" But it should, if escaping and unescaping is expected, report it as: "This, is a string" I think it would be good to clearly state what is expected of clients and servers regarding the escaping of querystring values for gemini. --? Sent with https://mailfence.com Secure and private email
On Sun, May 24, 2020 at 06:33:56PM +0200, Brian Evans wrote: > I think it would be good to clearly state what is expected of clients and servers > regarding the escaping of querystring values for gemini. Clients should definitely be URL escaping their queries, and servers should be unescaping them at their end. If this isn't done, then the thing that clients send to servers aren't genuine, RFC-compliant URLs. I would actually expect that any server using a URL-parsing function from a decent library would get an error from that function if attempting to parse an unescaped URL, and would in turn give a permanent failure status back to the Gemini client. Yet another client torture test? Cheers, Solderpunk
According to RFC3986: > The URI syntax provides a method of encoding data, presumably for the > sake of identifying a resource, as a sequence of characters. The URI > characters are, in turn, frequently encoded as octets for transport > or presentation. This specification does not mandate any particular > character encoding for mapping between URI characters and the octets > used to store or transmit those characters. When a URI appears in a > protocol element, the character encoding is defined by that protocol; > without such a definition, a URI is assumed to be in the same > character encoding as the surrounding text. Given the above, it seems that it is on gemini to define the encoding used for URIs. I particular this passage: > This specification does not mandate any particular > character encoding for mapping between URI characters and the octets > used to store or transmit those characters. To my knowledge gopher does not URL encode querystrings (such that they are in gopher - not generally compliant with other URI RFCs). I have left them unencoded in my recent release for Bombadillo in line with what a perceive to be community standards until that community standard changes or the spec explicitly makes clear the expectation. I imagine Sean will have some good information here to bring to the table, as he has read various RFCs in greater detail than I have. If anyone, not just Sean, thinks I have misinterpreted the above mentioned RFC or that there is not a need for a clear and explicit rule for how queries should be encoded in the spec, please let me know.
GUS uses Jetforce, which unquotes queries before passing them along [1]. I'm not sure exactly what happened in your test cases Brian, but typically I see escaped queries in the server logs (which I assume is due to most clients automatically escaping their users' queries for them). [1] https://github.com/michael-lazar/jetforce/blob/b5f4235535d8eabad5a15cdf 634f6d6149b37c29/jetforce/app/base.py#L64 And them from testing myself just now, when I submit a GUS query using any of the big-name clients, I see my own query A) show up in GUS server logs as escaped, and B) show up in the GUS query results page as unescaped. In the server logs, I think I found your queries, and one thing I notice is that you're passing in pluses (`+`) for spaces, which I don't actually think get handled by the standard quoting/unquoting machinery (at least in Python). In Python, there's a separate `unquote_plus()` method [2] which says it is "like unquote(), but also replaces plus signs by spaces, as required for unquoting HTML form values." So... I'm actually not sure if that's something Gemini should respect, given the lack of forms. [2] https://docs.python.org/3/library/urllib.parse.html#urllib.parse.unquote_plus On Sun, May 24, 2020 at 04:42:42PM +0000, solderpunk wrote: > On Sun, May 24, 2020 at 06:33:56PM +0200, Brian Evans wrote: > > > I think it would be good to clearly state what is expected of clients and servers > > regarding the escaping of querystring values for gemini. > > Clients should definitely be URL escaping their queries, and servers > should be unescaping them at their end. > > If this isn't done, then the thing that clients send to servers aren't > genuine, RFC-compliant URLs. I would actually expect that any server > using a URL-parsing function from a decent library would get an error > from that function if attempting to parse an unescaped URL, and would > in turn give a permanent failure status back to the Gemini client. > > Yet another client torture test? > > Cheers, > Solderpunk
Hmm. Perhaps we have a little bit of wiggle room, but my reading of RFC 3986 section 2.2 leads me to believe that, if nothing else, the characters: ":", "/", "?", "#", "[", "]", and "@" are reserved characters in the generic URI scheme (which I presume means they are also reserved in all protocol-specific schemes), so if those appear in a client's response to a query prompt, they *need* to be encoded some how - and for obvious reasons! You're right, however, that exactly how they are encoded is up to us. That's a surprise to me! So the spec will have to address this, but IMHO it's barely even up for discussion how to proceed here - we do it how HTTP does it so we can leverage the pre-existing libraries in every programming language under the sun, in keeping with the idea that Gemini machinery should be easily assembled from existing parts as much as possible. Cheers, Solderpunk
Actually I don't know if "big-name" is accurate - I think it's probably just a biased set of clients in languages I frequently build code in. Sorry! That may have come across as dismissive of other clients, which I didn't intend. But, FWIW, of those clients I already had around, here is a quick check of which ones escape and don't escape with a current build: Bombadillo : does not escape Elpher : escapes AV-98 : escapes Castor : escapes
On Sun, May 24, 2020 at 01:33:34PM -0400, Natalie Pendragon wrote: > Sorry! That may have come across as dismissive of other > clients, which I didn't intend. I'm sorry too, Brian, if my initial quick reply came across as dismissive of your quesiton. I had thought this was much more tightly sewn up by the URI RFC. Cheers, Solderpunk
Definitely no worries. I am a little bummed that I did not get this cleared up before a major release I just did. I just went through multi-hour (it needs to get improved) cross-compilation and website updates and do not relish doing so again this weekend. So this may sit for a short bit, but will get updated in the not distant future for Bombadillo. I have added the issue on the tildegit[1] and will try to get to it this week. I definitely agree that escaping should occur, but since my client was originally built as a gopher client I was not escaping there... and just never updated anything in that regard when gemini got added. The Go net/url module's QueryEscape[2] uses '+' for space rather than '%20'. So it seems that is why I was running into thinking people were not supporting escaping. What is the advice here? It seems that python does not, by default, escape a '+' as a space and a number of servers are using Python. Further, it seems that '+' can be, optionally, supported as representing a space... this seems like it can/will lead to problems. My guess is that Molly Brown will support it fine, but that JetForce and many others may not. [1] https://tildegit.org/sloum/bombadillo/issues/161#issuecomment-5137 [2] https://golang.org/pkg/net/url/#QueryEscape
It was thus said that the Great Brian Evans once stated: > Greetings, > I got a bug report recently for Bombadillo about how I have been handling > query strings. [ snip ] > I think it would be good to clearly state what is expected of clients and > servers regarding the escaping of querystring values for gemini. There are three standards conflating here. They are: [CGI] RFC-3875 [URI] RFC-3986 [WEBFORM] https://www.w3.org/TR/html401/interact/forms.html I'm going to try to do a summary here (if anyone is interested in the gory details, check the docs listed above). To encode a URL (per [URI]), the following characters can be used AS IS: ALPHA DIGIT - . _ ~ and the following characters MUST always be encoded [1]: % < > [ ] { } | \ ^ SPACE CONTROL NON-ASCII The set of characters not included in this depend upon where in the URL is appears (more on that below). Encoding a character means converting it to its hex value and preceeding it with a '%': ##% -> %23%23%25 Each section of a URL (scheme, authority [2], path, query, fragment) allows certain characters that would otherwise be encoded to NOT be encoded. I'll concentrate on the query portion since that's the part under question. The query portion allows the following characters to appear non-encoded: ALPHA DIGIT - . _ ~ / ? : @ The '=' and '&' are used as sub-delimeters (to separate name and value, and to separate namevalue pairs). If a '=' or '&' appear in a name or the value, they have to be encoded. The '+' sign is listed as a sub-delimeter in [URI], but otherwise says nothing about it. [CGI] and [WEBFORM] define it differently. [CGI] allows it, but *only* if '=' and '&' aren't used (section 4.4): ...?one+two+three '+' ALLOWED ...?one+two=3&three=3 '+' DISALLOWED And in this case, the '+' is to be treated as a space. In any other case, the space needs to be encoded: ...?query=what%20is%20this%20madness&lang=en DEFINED ...?query=what+is+this+madness&lang=en UNDEFINED [WEBFORM] defines the '+' to be a space, but only when the data is being sent as part of a POST, and the content type is "application/x-www-form-urlencoded". This doesn't apply at all to Gemini. Now, it could be that there are webservers (or CGI scripts) that convert '+' to spaces reguardless. I'm just saying ... Hopefully, this clears it all up (said as he wipes the mud off his face). -spc (Don't hesitate to ask any questions ... ) [1] You'd be hard pressed to see these listed in [URI] since they aren't listed! RFC-1738 lists those characters explicitly, so that's four references. Sorry. [2] [URI] calls the host portion "authority".
I think it might just make the most sense to say in the spec that encoding is required, and should be done with percent signs, for spaces too. Like in Sean's message: ?query=what%20is%20this%20madness&lang=en makeworld ??????? Original Message ??????? On Sunday, May 24, 2020 5:28 PM, Sean Conner <sean at conman.org> wrote: > It was thus said that the Great Brian Evans once stated: > > > Greetings, > > I got a bug report recently for Bombadillo about how I have been handling > > query strings. > > [ snip ] > > > I think it would be good to clearly state what is expected of clients and > > servers regarding the escaping of querystring values for gemini. > > There are three standards conflating here. They are: > > [CGI] RFC-3875 > [URI] RFC-3986 > [WEBFORM] https://www.w3.org/TR/html401/interact/forms.html > > I'm going to try to do a summary here (if anyone is interested in the gory > details, check the docs listed above). To encode a URL (per [URI]), the > following characters can be used AS IS: > > ALPHA DIGIT - . _ ~ > > and the following characters MUST always be encoded [1]: > > % < > [ ] { } | \ ^ SPACE CONTROL NON-ASCII > > The set of characters not included in this depend upon where in the URL is > appears (more on that below). > > Encoding a character means converting it to its hex value and preceeding > it with a '%': > > ##% -> %23%23%25 > > Each section of a URL (scheme, authority [2], path, query, fragment) > allows certain characters that would otherwise be encoded to NOT be encoded. > I'll concentrate on the query portion since that's the part under question. > The query portion allows the following characters to appear non-encoded: > > ALPHA DIGIT - . _ ~ / ? : @ > > The '=' and '&' are used as sub-delimeters (to separate name and value, > and to separate namevalue pairs). If a '=' or '&' appear in a name or the > value, they have to be encoded. > > The '+' sign is listed as a sub-delimeter in [URI], but otherwise says > nothing about it. [CGI] and [WEBFORM] define it differently. [CGI] allows > it, but only if '=' and '&' aren't used (section 4.4): > > ...?one+two+three '+' ALLOWED > ...?one+two=3&three=3 '+' DISALLOWED > > And in this case, the '+' is to be treated as a space. In any other case, > the space needs to be encoded: > > ...?query=what%20is%20this%20madness&lang=en DEFINED > ...?query=what+is+this+madness&lang=en UNDEFINED > > [WEBFORM] defines the '+' to be a space, but only when the data is being > sent as part of a POST, and the content type is > "application/x-www-form-urlencoded". This doesn't apply at all to Gemini. > > Now, it could be that there are webservers (or CGI scripts) that convert > '+' to spaces reguardless. I'm just saying ... > > Hopefully, this clears it all up (said as he wipes the mud off his face). > > -spc (Don't hesitate to ask any questions ... ) > > [1] You'd be hard pressed to see these listed in [URI] since they aren't > listed! RFC-1738 lists those characters explicitly, so that's four > references. Sorry. > > [2] [URI] calls the host portion "authority".
On Mon, May 25, 2020 at 03:54:32PM +0000, colecmac at protonmail.com wrote: > I think it might just make the most sense to say in the spec that > encoding is required, and should be done with percent signs, for > spaces too. Like in Sean's message: > > ?query=what%20is%20this%20madness&lang=en Yeah, I will do this, once I find the time to dig through the relevant RFCs and make sure I can explain what should be done succinctly and unambiguously using all the right terminology. If this isn't done in like a week or so, somebody poke me! Cheers, Solderpunk
---