💾 Archived View for gemi.dev › gemini-mailing-list › 000144.gmi captured on 2023-11-04 at 12:28:29. Gemini links have been rewritten to link to archived content

View Raw

More Information

➡️ Next capture (2023-12-28)

-=-=-=-=-=-=-

Query Strings

Brian Evans <b__m__e (a) mailfence.com>

Greetings,
I got a bug report recently for Bombadillo about how I have been handling query strings.
I had accidentally left a big in where I was chaining queries:

some.site/thing.gmi?hello?world

They brought up, and it seemed clear to me, that the query should be replaced and not
added on to. So I have done that update (and it will be a part of a big release soon, for
which I will notify the list when it happens).

The other thing they brought up was escaping before sending to the server. Bombadillo
does not escape querystrings currently (for gopher or for gemini).

A look at the spec provides no further explanation re: what is expected: escaped or
unescaped? The spec does point to an RFC that no doubt encourages the escaping
of querystrings (I have not read through the RFC recently)... 

However, in practice, servers offering search (GUS and Houston) do not seem to
unescape a query that is sent to them. I have not tested places other than those two
but neither seems to do so. If I send:

some.site/thing.gmi?This%2C+is+a+string

GUS reports the search string as: "This%2C+is+a+string"
But it should, if escaping and unescaping is expected, report it as: "This, is a string"

I think it would be good to clearly state what is expected of clients and servers
regarding the escaping of querystring values for gemini.

--?
Sent with https://mailfence.com
Secure and private email

Link to individual message.

solderpunk <solderpunk (a) SDF.ORG>

On Sun, May 24, 2020 at 06:33:56PM +0200, Brian Evans wrote:
 
> I think it would be good to clearly state what is expected of clients and servers
> regarding the escaping of querystring values for gemini.

Clients should definitely be URL escaping their queries, and servers
should be unescaping them at their end.

If this isn't done, then the thing that clients send to servers aren't
genuine, RFC-compliant URLs.  I would actually expect that any server
using a URL-parsing function from a decent library would get an error
from that function if attempting to parse an unescaped URL, and would
in turn give a permanent failure status back to the Gemini client.

Yet another client torture test?

Cheers,
Solderpunk

Link to individual message.

Brian Evans <b__m__e (a) mailfence.com>

According to RFC3986:
>   The URI syntax provides a method of encoding data, presumably for the
>  sake of identifying a resource, as a sequence of characters.  The URI
>   characters are, in turn, frequently encoded as octets for transport
>   or presentation.  This specification does not mandate any particular
>   character encoding for mapping between URI characters and the octets
>   used to store or transmit those characters.  When a URI appears in a
>   protocol element, the character encoding is defined by that protocol;
>   without such a definition, a URI is assumed to be in the same
>   character encoding as the surrounding text.

Given the above, it seems that it is on gemini to define the encoding used
for URIs.

I particular this passage:
>   This specification does not mandate any particular
>   character encoding for mapping between URI characters and the octets
>   used to store or transmit those characters.

To my knowledge gopher does not URL encode querystrings (such that they
are in gopher - not generally compliant with other URI RFCs). 

I have left them unencoded in my recent release for Bombadillo in line
with what a perceive to be community standards until that community
standard changes or the spec explicitly makes clear the expectation.

I imagine Sean will have some good information here to bring to the table,
as he has read various RFCs in greater detail than I have. If anyone, not just
Sean, thinks I have misinterpreted the above mentioned RFC or that there
is not a need for a clear and explicit rule for how queries should be encoded
in the spec, please let me know.

Link to individual message.

Natalie Pendragon <natpen (a) natpen.net>

GUS uses Jetforce, which unquotes queries before passing them along
[1]. I'm not sure exactly what happened in your test cases Brian, but
typically I see escaped queries in the server logs (which I assume is
due to most clients automatically escaping their users' queries for
them).

[1] https://github.com/michael-lazar/jetforce/blob/b5f4235535d8eabad5a15cdf
634f6d6149b37c29/jetforce/app/base.py#L64

And them from testing myself just now, when I submit a GUS query using
any of the big-name clients, I see my own query A) show up in GUS
server logs as escaped, and B) show up in the GUS query results page
as unescaped.

In the server logs, I think I found your queries, and one thing I
notice is that you're passing in pluses (`+`) for spaces, which I
don't actually think get handled by the standard quoting/unquoting
machinery (at least in Python). In Python, there's a separate
`unquote_plus()` method [2] which says it is "like unquote(), but also
replaces plus signs by spaces, as required for unquoting HTML form
values." So... I'm actually not sure if that's something Gemini should
respect, given the lack of forms.

[2] https://docs.python.org/3/library/urllib.parse.html#urllib.parse.unquote_plus

On Sun, May 24, 2020 at 04:42:42PM +0000, solderpunk wrote:
> On Sun, May 24, 2020 at 06:33:56PM +0200, Brian Evans wrote:
>
> > I think it would be good to clearly state what is expected of clients and servers
> > regarding the escaping of querystring values for gemini.
>
> Clients should definitely be URL escaping their queries, and servers
> should be unescaping them at their end.
>
> If this isn't done, then the thing that clients send to servers aren't
> genuine, RFC-compliant URLs.  I would actually expect that any server
> using a URL-parsing function from a decent library would get an error
> from that function if attempting to parse an unescaped URL, and would
> in turn give a permanent failure status back to the Gemini client.
>
> Yet another client torture test?
>
> Cheers,
> Solderpunk

Link to individual message.

solderpunk <solderpunk (a) SDF.ORG>

Hmm.  Perhaps we have a little bit of wiggle room, but my reading of RFC
3986 section 2.2 leads me to believe that, if nothing else, the
characters:

":", "/", "?", "#", "[", "]", and "@"

are reserved characters in the generic URI scheme (which I presume means
they are also reserved in all protocol-specific schemes), so if those
appear in a client's response to a query prompt, they *need* to be
encoded some how - and for obvious reasons!

You're right, however, that exactly how they are encoded is up to us.
That's a surprise to me!  So the spec will have to address this, but
IMHO it's barely even up for discussion how to proceed here - we do it
how HTTP does it so we can leverage the pre-existing libraries in every
programming language under the sun, in keeping with the idea that Gemini
machinery should be easily assembled from existing parts as much as
possible.

Cheers,
Solderpunk

Link to individual message.

Natalie Pendragon <natpen (a) natpen.net>

Actually I don't know if "big-name" is accurate - I think it's
probably just a biased set of clients in languages I frequently build
code in. Sorry! That may have come across as dismissive of other
clients, which I didn't intend. But, FWIW, of those clients I already
had around, here is a quick check of which ones escape and don't
escape with a current build:

Bombadillo : does not escape
Elpher     : escapes
AV-98      : escapes
Castor     : escapes

Link to individual message.

solderpunk <solderpunk (a) SDF.ORG>

On Sun, May 24, 2020 at 01:33:34PM -0400, Natalie Pendragon wrote:

> Sorry! That may have come across as dismissive of other
> clients, which I didn't intend.

I'm sorry too, Brian, if my initial quick reply came across as
dismissive of your quesiton.  I had thought this was much more tightly
sewn up by the URI RFC.

Cheers,
Solderpunk

Link to individual message.

Brian Evans <b__m__e (a) mailfence.com>

Definitely no worries. I am a little bummed that I did not get this 
cleared up before a major
 release I just did. I just went through multi-hour (it needs to get 
improved) cross-compilation 
and website updates and do not relish doing so again this weekend. So this 
may sit for a short
bit, but will get updated in the not distant future for Bombadillo. I have 
added the issue on
the tildegit[1] and will try to get to it this week.

I definitely agree that escaping should occur, but since my client was 
originally built as a
gopher client I was not escaping there... and just never updated anything in that regard
when gemini got added. The Go net/url module's QueryEscape[2] uses '+' for space rather 
than '%20'.  So it seems that is why I was running into thinking people 
were not supporting 
escaping. What is the advice here? It seems that python does not, by 
default, escape a '+'
as a space and a number of servers are using Python. Further, it seems that '+' can be,
optionally, supported as representing a space... this seems like it 
can/will lead to problems.

My guess is that Molly Brown will support it fine, but that JetForce and many others may
not.

[1] https://tildegit.org/sloum/bombadillo/issues/161#issuecomment-5137
[2] https://golang.org/pkg/net/url/#QueryEscape

Link to individual message.

Sean Conner <sean (a) conman.org>

It was thus said that the Great Brian Evans once stated:
> Greetings,
> I got a bug report recently for Bombadillo about how I have been handling
> query strings. 

  [ snip ]

> I think it would be good to clearly state what is expected of clients and
> servers regarding the escaping of querystring values for gemini.

  There are three standards conflating here.  They are:

	[CGI]	  RFC-3875
	[URI]	  RFC-3986
	[WEBFORM] https://www.w3.org/TR/html401/interact/forms.html

  I'm going to try to do a summary here (if anyone is interested in the gory
details, check the docs listed above).  To encode a URL (per [URI]), the
following characters can be used AS IS:

	ALPHA DIGIT - . _ ~

and the following characters MUST always be encoded [1]:

	% < > [ ] { } | \ ^ SPACE CONTROL NON-ASCII

  The set of characters not included in this depend upon where in the URL is
appears (more on that below).

  Encoding a character means converting it to its hex value and preceeding
it with a '%':

	##%	-> %23%23%25

  Each section of a URL (scheme, authority [2], path, query, fragment)
allows certain characters that would otherwise be encoded to NOT be encoded. 
I'll concentrate on the query portion since that's the part under question. 
The query portion allows the following characters to appear non-encoded:

	ALPHA DIGIT - . _ ~ / ? : @

  The '=' and '&' are used as sub-delimeters (to separate name and value,
and to separate namevalue pairs).  If a '=' or '&' appear in a name or the
value, they have to be encoded.

  The '+' sign is listed as a sub-delimeter in [URI], but otherwise says
nothing about it.  [CGI] and [WEBFORM] define it differently.  [CGI] allows
it, but *only* if '=' and '&' aren't used (section 4.4):

	...?one+two+three		'+' ALLOWED
	...?one+two=3&three=3		'+' DISALLOWED

  And in this case, the '+' is to be treated as a space.  In any other case,
the space needs to be encoded:

	...?query=what%20is%20this%20madness&lang=en	DEFINED
	...?query=what+is+this+madness&lang=en		UNDEFINED

  [WEBFORM] defines the '+' to be a space, but only when the data is being
sent as part of a POST, and the content type is
"application/x-www-form-urlencoded".  This doesn't apply at all to Gemini.

  Now, it could be that there are webservers (or CGI scripts) that convert
'+' to spaces reguardless.  I'm just saying ...

  Hopefully, this clears it all up (said as he wipes the mud off his face).

  -spc (Don't hesitate to ask any questions ... )

[1]	You'd be hard pressed to see these listed in [URI] since they aren't
	listed!  RFC-1738 lists those characters explicitly, so that's four
	references.  Sorry.

[2]	[URI] calls the host portion "authority".

Link to individual message.

colecmac@protonmail.com <colecmac (a) protonmail.com>

I think it might just make the most sense to say in the spec that
encoding is required, and should be done with percent signs, for
spaces too. Like in Sean's message:

?query=what%20is%20this%20madness&lang=en


makeworld

??????? Original Message ???????
On Sunday, May 24, 2020 5:28 PM, Sean Conner <sean at conman.org> wrote:

> It was thus said that the Great Brian Evans once stated:
>
> > Greetings,
> > I got a bug report recently for Bombadillo about how I have been handling
> > query strings.
>
> [ snip ]
>
> > I think it would be good to clearly state what is expected of clients and
> > servers regarding the escaping of querystring values for gemini.
>
> There are three standards conflating here. They are:
>
> [CGI] RFC-3875
> [URI] RFC-3986
> [WEBFORM] https://www.w3.org/TR/html401/interact/forms.html
>
> I'm going to try to do a summary here (if anyone is interested in the gory
> details, check the docs listed above). To encode a URL (per [URI]), the
> following characters can be used AS IS:
>
> ALPHA DIGIT - . _ ~
>
> and the following characters MUST always be encoded [1]:
>
> % < > [ ] { } | \ ^ SPACE CONTROL NON-ASCII
>
> The set of characters not included in this depend upon where in the URL is
> appears (more on that below).
>
> Encoding a character means converting it to its hex value and preceeding
> it with a '%':
>
> ##% -> %23%23%25
>
> Each section of a URL (scheme, authority [2], path, query, fragment)
> allows certain characters that would otherwise be encoded to NOT be encoded.
> I'll concentrate on the query portion since that's the part under question.
> The query portion allows the following characters to appear non-encoded:
>
> ALPHA DIGIT - . _ ~ / ? : @
>
> The '=' and '&' are used as sub-delimeters (to separate name and value,
> and to separate namevalue pairs). If a '=' or '&' appear in a name or the
> value, they have to be encoded.
>
> The '+' sign is listed as a sub-delimeter in [URI], but otherwise says
> nothing about it. [CGI] and [WEBFORM] define it differently. [CGI] allows
> it, but only if '=' and '&' aren't used (section 4.4):
>
> ...?one+two+three '+' ALLOWED
> ...?one+two=3&three=3 '+' DISALLOWED
>
> And in this case, the '+' is to be treated as a space. In any other case,
> the space needs to be encoded:
>
> ...?query=what%20is%20this%20madness&lang=en DEFINED
> ...?query=what+is+this+madness&lang=en UNDEFINED
>
> [WEBFORM] defines the '+' to be a space, but only when the data is being
> sent as part of a POST, and the content type is
> "application/x-www-form-urlencoded". This doesn't apply at all to Gemini.
>
> Now, it could be that there are webservers (or CGI scripts) that convert
> '+' to spaces reguardless. I'm just saying ...
>
> Hopefully, this clears it all up (said as he wipes the mud off his face).
>
> -spc (Don't hesitate to ask any questions ... )
>
> [1] You'd be hard pressed to see these listed in [URI] since they aren't
> listed! RFC-1738 lists those characters explicitly, so that's four
> references. Sorry.
>
> [2] [URI] calls the host portion "authority".

Link to individual message.

solderpunk <solderpunk (a) SDF.ORG>

On Mon, May 25, 2020 at 03:54:32PM +0000, colecmac at protonmail.com wrote:
> I think it might just make the most sense to say in the spec that
> encoding is required, and should be done with percent signs, for
> spaces too. Like in Sean's message:
> 
> ?query=what%20is%20this%20madness&lang=en

Yeah, I will do this, once I find the time to dig through the relevant
RFCs and make sure I can explain what should be done succinctly and
unambiguously using all the right terminology.

If this isn't done in like a week or so, somebody poke me!

Cheers,
Solderpunk

Link to individual message.

---

Previous Thread: [ANN] Big update to Bombadillo

Next Thread: Split the spec into two