💾 Archived View for gemi.dev › gemini-mailing-list › 000003.gmi captured on 2023-11-04 at 12:17:22. Gemini links have been rewritten to link to archived content
-=-=-=-=-=-=-
Two quick points with regard to the fact that Gemini currently does not convey file sizes to users at any point:
It was thus said that the Great solderpunk once stated: > Two quick points with regard to the fact that Gemini currently does not > convey file sizes to users at any point: > > * Sean has pointed out in one of his RFCs that this means there is no > way for a client to know whether or not a download completed > successfully or was interrupted due to an accidentally dropped or > even a maliciously severred connection > > * I've received an email from somebody watching the Gemini design unfold > with interest, who is concerned about Gemini clients with limited > system resources unwittingly downloading large files (such as PDFs of > scanned documents) which they aren't even capable of opening. While I > quite like the idea of Gemini being friendly to low-end systems, I do > wonder whether or not the TLS requirement makes this a little moot. > > Anyway, the question is do we want to change anything to address these > issues and if so how do we want to do it? > > I'll quickly note in pasing that both of these problems also exist in > exactly the same form for Gopher, but I've never once heard Gopher users > complain about them. Gopher does address this rather obliquely---text files (and gopher indexes) are supposed to end with a '.' on a line by itself. This lets the client know it received the data correctly, and it says as much in RFC-1436, section 3.8: Note that for type 5 or type 9 the client must be prepared to read until the connection closes. There will be no period at the end of the file; ... It's not necessarily a pain point about the filesizes not being known before hand, but it does make displaying a progress bar (for example) difficult to implement. > One possibility, as proposed by Sean, is to add file size to the > response header, with it optionally appearing after the MIME type. I'm > not hugely fond of this myself, simply because it complicates parsing of > the response header. I'm not seeing much of an issue. Assuming tabs separate the compoents on the status line, then (\d+)\t([^\t]+)(\t([^\t]+))* would parse the line (I suspect, I'm not a fan of regex but I think the above would work to parse the status line). I don't see much of an issue in parsing any of the following: 20<HTAB>text/plain; charset=utf-8<HTAB>2123<CRLF> 20<HTAB>text/plain<CRLF> > Remember that the MIME type can have multiple > components specifying encodings etc. If you just split the META part of > the header on whitespace, the number of components is variable, so > recognising whether or not an optional filesize is present requires > actually inspecting the parts and looking for a number. In fairness to > Sean, at the time of writing of his RFC the spec spec said META was > separated from STATUS by a tab (whereas now it is just whitespace), so > tacking something after META with another tab was unambiguous, assuming > nobody put tabs in their MIME types... Which could be specified, "don't put tabs in the MIME type section." > Another possibility ties into another request I got from somebody very > early on - it would be nice if there was some way to query a Gemini > server for the time a resource was last modified, so that Gemini > equivaents of tools like moku pona could avoid needlessly fetching > unchanged resources over and over again. At that point I started > wondering about giving Gemini some equivalent of HTTP HEAD, although I > abandoned it pretty quickly when I realised that substantial TLS > overhead probably made making a whole second request to check if a > resource had changed not such a worthwhile idea. One way would be to query a well-known endpoint (these exist in the HTTP world---robots.txt is one such file) that contains tiemstamps for various resources. Slap a MIME type of text/gemini-timestamp and call it done: gemini://example.com/ 2019-08-15T13:53:00-05:00 gemini://example.com/feed 2019-07-29T00:00:00-05:00 gemini://example.com/other 2019-08-01T00:00:00-05:00 That's one way. > But, we could possibly > bring this idea back, as the response to such a request could naturally > include the file size as well. The real question is how to *make* such > a request, ideally in a way which doesn't open the door to a half dozen > other new "methods". As I mentioned in a private email to solderpunk earlier, one could always take advantage of the sub-delimeters in the path portion. I had at one point mentioned using those to specify the prompt (otherwise the server would return a status of 10): gemini://example.com/search;Search%20for This could be formalized: gemini://example.com/search;prompt=Search%20for gemini://example.com/blogfeed;timestamp=2019-08-15T00:00:00Z gemini://example.com/wildexample;prompt=Search%20for;timestamp=2019-08-15: 00:00:00Z?query=foo&usename=bar So, you have "prompt" and "timestamp". Others could be propsed. If the "timestamp" thing above is accepted, then you might want to have a new status code meaning "no change" or "okay, but there's no content". > Regarding ways to enable something like a HEAD request without changing > the request format to include a method field - I'm not quite sure > whether using a fixed URL fragment, like #meta, on requests would be a > kosher way to do this. Does metadata count as "some portion or subset > of the primary resource, some view on representations of the primary > resource, or some other resource defined or described by those > representations" (from RFC3986)? Well, there are RFC-5147 and RFC-7111 that give semantics to the URI fragment section, but I still think using the sub-delimeter of ';' in the path portion is the way to go. -spc
> Gopher does address this rather obliquely---text files (and gopher > indexes) are supposed to end with a '.' on a line by itself. This lets the > client know it received the data correctly, and it says as much in RFC-1436, > section 3.8: Whoops, true! In my defence, I think this is very rarely used nowadays. VF-1 includes no code whatsoever to detect and strip this from files it downloads and I've never seen one appear on screen. > I'm not seeing much of an issue. Assuming tabs separate the compoents on > the status line, then > > (\d+)\t([^\t]+)(\t([^\t]+))* > > would parse the line (I suspect, I'm not a fan of regex but I think the > above would work to parse the status line). I don't see much of an issue in > parsing any of the following: > > 20<HTAB>text/plain; charset=utf-8<HTAB>2123<CRLF> > 20<HTAB>text/plain<CRLF> > > Which could be specified, "don't put tabs in the MIME type section." Yes, with sufficient prescription of whitespace practices in response headers it could be made sufficiently parsable, but it would be nice if things weren't so brittle. This also, of course, sets a precendent of "whenever we decide a little bit of extra metadata would be handy in the header, just append it after a tab", which over time could bloat our header until it's basically just a HTTP header in disguise with tabs instead of newlines. (not a fan of regex either, by the way, and was quite happy to discover Lua's lightweight alternative system when I first picked it up) > One way would be to query a well-known endpoint (these exist in the HTTP > world---robots.txt is one such file) that contains tiemstamps for various > resources. Slap a MIME type of text/gemini-timestamp and call it done: > > gemini://example.com/ 2019-08-15T13:53:00-05:00 > gemini://example.com/feed 2019-07-29T00:00:00-05:00 > gemini://example.com/other 2019-08-01T00:00:00-05:00 > > That's one way. I actually quite like this idea. No need to make it timestamp-specific either. We could have a well-known endpoint for general file metadata, which listed modification time, file size, checksum, MIME type, etc. It could accept queries for a specific path, and *that* could be the way to do an equivalent of a HEAD request. This would let clients for specific scenarios do the extra work themselves to work around their problems, e.g. clients with very low memory or storage space could request the metadata for all files before attempting to access them and warn the user if the file size exceeds a threshold; clients on unreliable connections could request the metadata before downloading and then warn the user if file size and/checksum did not agree. Most "normal" clients could do neither and just operate as they already do. I think it's kind of neat to keep solutions to edge problems outside of the protocol itself and push them into things like well-known endpoints like the above where they can easily be ignored when they are not needed/wanted. The downside is that server developers have to do the work to add support for these things - but it's expected, I think, that servers are harder to write than clients. Ease of client implementation is very important - it leads to a large number of independent clients, which means unofficial extensions of the standard can only really take off if a large number of people with presumably diverse opinions can be convinced they are worthwhile. And, of course, some server authors can just choose not to support some of these endpoints, and when queried can just return status 51 and then the client understands they are on their own. All of this can be done without any change to the core Gemini spec (each well-known endpoint, of course, would need its own spec). > As I mentioned in a private email to solderpunk earlier, one could always > take advantage of the sub-delimeters in the path portion. I had at one > point mentioned using those to specify the prompt (otherwise the server > would return a status of 10): > > gemini://example.com/search;Search%20for > > This could be formalized: > > gemini://example.com/search;prompt=Search%20for > gemini://example.com/blogfeed;timestamp=2019-08-15T00:00:00Z > gemini://example.com/wildexample;prompt=Search%20for;timestamp=2019-08-1 5:00:00:00Z?query=foo&usename=bar > > So, you have "prompt" and "timestamp". Others could be propsed. If the > "timestamp" thing above is accepted, then you might want to have a new > status code meaning "no change" or "okay, but there's no content". I think I prefer the well-known endpoint over this, but that's right now more of a gut reaction and not a well thought-out and defencible position. > Well, there are RFC-5147 and RFC-7111 that give semantics to the URI > fragment section, but I still think using the sub-delimeter of ';' in the > path portion is the way to go. Ah, more for the reading list! -Solderpunk
solderpunk writes: > This also, of course, sets a precendent of "whenever we decide a little > bit of extra metadata would be handy in the header, just append it after > a tab", which over time could bloat our header until it's basically just > a HTTP header in disguise with tabs instead of newlines. We are, of course, re-inventing the Content-Length header here, which is an extremely useful header so that clients can do things like have a progress bar for downloads. If we support every useful feature, of course, we end up with basically HTTP 0.9. > Sean writes: >> One way would be to query a well-known endpoint (these exist in the HTTP >> world---robots.txt is one such file) that contains tiemstamps for various >> resources. Slap a MIME type of text/gemini-timestamp and call it done: >> >> gemini://example.com/ 2019-08-15T13:53:00-05:00 >> gemini://example.com/feed 2019-07-29T00:00:00-05:00 >> gemini://example.com/other 2019-08-01T00:00:00-05:00 >> >> That's one way. IMO you probably don't want to serve the timestamps or file sizes for every single file/resource in one response, for a number of reasons both privacy-related and performance-related. > I actually quite like this idea. No need to make it timestamp-specific > either. We could have a well-known endpoint for general file metadata, > which listed modification time, file size, checksum, MIME type, etc. It > could accept queries for a specific path, and *that* could be the way to > do an equivalent of a HEAD request. Yeah, it makes sense to bundle the metadata into one response, but only serve for one path per request. > This would let clients for specific scenarios do the extra work > themselves to work around their problems, e.g. clients with very low > memory or storage space could request the metadata for all files before > attempting to access them and warn the user if the file size exceeds a > threshold; clients on unreliable connections could request the metadata > before downloading and then warn the user if file size and/checksum did > not agree. Most "normal" clients could do neither and just operate as > they already do. The one downside I see to that is that once we start seeing more complex clients written, especially graphical clients, we will start seeing progress bars, download managers, etc., and there will always be a need to make a second request to get the metadata, which is relatively expensive. > I think it's kind of neat to keep solutions to edge problems outside of > the protocol itself and push them into things like well-known endpoints > like the above where they can easily be ignored when they are not > needed/wanted. The downside is that server developers have to do the > work to add support for these things - but it's expected, I think, that > servers are harder to write than clients. My experience *so far* is that server implementation is easier. But that is likely to change, I guess. >> As I mentioned in a private email to solderpunk earlier, one could always >> take advantage of the sub-delimeters in the path portion. I had at one >> point mentioned using those to specify the prompt (otherwise the server >> would return a status of 10): >> >> gemini://example.com/search;Search%20for >> >> This could be formalized: >> >> gemini://example.com/search;prompt=Search%20for >> gemini://example.com/blogfeed;timestamp=2019-08-15T00:00:00Z >> gemini://example.com/wildexample;prompt=Search%20for;timestamp=2019-08- 15:00:00:00Z?query=foo&usename=bar >> >> So, you have "prompt" and "timestamp". Others could be propsed. If the >> "timestamp" thing above is accepted, then you might want to have a new >> status code meaning "no change" or "okay, but there's no content". This is definitely re-inventing HTTP headers with a different format. -- +----------------------------------------------------------------------+ | Jason F. McBrayer jmcbray at carcosa.net | | The scalloped tatters of the King in Yellow must hide Yhtill forever.|
> We are, of course, re-inventing the Content-Length header here, which is > an extremely useful header so that clients can do things like have a > progress bar for downloads. If we support every useful feature, of > course, we end up with basically HTTP 0.9. Right, and that's what I'd like to avoid even taking one step toward to doing, because unless we resist it hard from day one, I worry we'll inevitably fall down a slippery slope to recreating HTTP. In some sense, Gemini as-is chooses Content-Type as the "one true header". To some extent that needs justification, because there are certainly other HTTP headers which don't do any evil and are genuinely helpful. I guess Content-Type was elevated because in my experience with Gopher, it was the missing piece of information which caused the most pain. The lack of in-band content type signalling in Gopher requires things like a custom URL scheme to encode that information, which is a bit gross. > IMO you probably don't want to serve the timestamps or file sizes for > every single file/resource in one response, for a number of reasons both > privacy-related and performance-related. Yeah, putting every hosted resource in one response makes no sense - that response could easily be larger than the one resource we were interested in checking the timestamp or file size of. The sensible thing is to include the particular path you care about as a query. If that is absent, a sensible default might be to list the N most recently updated resources, or the N most popularly requested? But, really, I'm not sure who would be making that kind of request or why. > The one downside I see to that is that once we start seeing more complex > clients written, especially graphical clients, we will start seeing > progress bars, download managers, etc., and there will always be a need > to make a second request to get the metadata, which is relatively > expensive. While they are not prohibited, using Gemini to transfer large binary files (of the kind for which progress bars and download managers are useful) is not a very smart idea. There's no way to use compression to speed things up, and there is no way to resume an interrupted download with anything like a Range: header. It's just not the right tool for the job. Thankfully, it's very easy for a Gemini document to link to things hosted on other protocols, like https or even bittorrent, or one of these newfangled P2P things the kids are using like IPFS, which are more appropriate for that task. I would say it makes sense to recommend not using Gemini for files larger than, I dunno, 10MB? The second request is definitely expensive, but the idea is not that every client does this before every "real" request. This "pseudo HEAD" is mostly a support mechanism for clients with unreliable and/or slow connections or with very limited disc space and/or memory. I expect 95% of clients to work exactly the way Gopher clients do, by just ignoring all of this and generally being none the worse for it. Well, maybe 95% is a bit strong. Moku pona style aggregators might be able to make good use of this, too. Very slick clients could just request a resource directly, keep track of how much response they've pulled down so far (or how many seconds they've been reading for) and once some threshold is passed (1MB? 5 seconds?) make another request in a separate thread for the metadata and then begin displaying a progress bar or ETA prompt or something to reassure the user that things really are happening. > My experience *so far* is that server implementation is easier. But that > is likely to change, I guess. I guess for a sufficiently simple server which only deals in 20, 40 and 50 that's true. > This is definitely re-inventing HTTP headers with a different format. It is - while the well-known endpoint for metadata is basically reinventing a HTTP method with a different format. Maybe we don't need to worry about this? There's no shame in not being able to match every single useful capability that HTTP has. It should have been clear from the outset we wouldn't be able to do that. -Solderpunk
>>>> * Sean has pointed out in one of his RFCs that this means there is no >>>> way for a client to know whether or not a download completed >>>> successfully or was interrupted due to an accidentally dropped or >>>> even a maliciously severred connection Does TLS not have a native way to handle this? I did a quick search and found that either the client/server may send a close_notify message before terminating a TLS connection. https://tools.ietf.org/html/rfc5246#section-7.2.1 > Very slick clients could just request a resource directly, keep track of > how much response they've pulled down so far (or how many seconds > they've been reading for) and once some threshold is passed (1MB? 5 > seconds?) make another request in a separate thread for the metadata and > then begin displaying a progress bar or ETA prompt or something to > reassure the user that things really are happening. The whole idea is serving metadata in a separate TCP request might work fine for static files, but it would break down when attempting to serve any type of dynamically generated content. Say I want to host an endpoint that does something like return a random quote from fortune(1). Gopher sites love to serve these little CGI tools because they're fun and require very little effort :) After I generate the text, I can easily calculate the size and send along with the response in-line using something like a response header. But I wouldn't be able to tell you ahead of time from a separate metadata TCP request. > Maybe we don't need to worry about this? There's no shame in not being > able to match every single useful capability that HTTP has. It should > have been clear from the outset we wouldn't be able to do that. I agree with this sentiment. Knowing the file size would be nice to have, but I haven't seen any use cases that are compelling enough to justify increasing the complexity to the protocol for. Even in HTTP land, Content-Length is an optional response header and there are several scenarios where it doesn't make sense to include it at all. - mozz
> Does TLS not have a native way to handle this? I did a quick search and found > that either the client/server may send a close_notify message before > terminating a TLS connection. > > https://tools.ietf.org/html/rfc5246#section-7.2.1 Ah, this is a great detail to have noticed! I'll need to look into how hard it typically is for clients to check that they've received this. If it's straightforward, it's probably a good idea to amend the spec to say that servers MUST send this close_notify message. That would then act as an analogue of Gopher's lone dot. > I agree with this sentiment. Knowing the file size would be nice to have, but I > haven't seen any use cases that are compelling enough to justify increasing the > complexity to the protocol for. Even in HTTP land, Content-Length is an optional > response header and there are several scenarios where it doesn't make sense to > include it at all. Yeah, I think we can leave this for now. It was a hypothetical concern that somebody had. Not necessarily a bad one, but until it's observed actually creating significant trouble for actual users on actual clients I think we can just table this issue. If it does come up as a practical concern, we can resume discussion of some of the ideas here. -Solderpunk
It was thus said that the Great solderpunk once stated: > > Yeah, I think we can leave this for now. It was a hypothetical concern > that somebody had. Not necessarily a bad one, but until it's observed > actually creating significant trouble for actual users on actual clients > I think we can just table this issue. If it does come up as a practical > concern, we can resume discussion of some of the ideas here. I had a thought last night in that there does, in fact, exist a way to include the information without changing the syntax of the protocol. The MIME type that is returned can have parameters, and these parameters can act as small headers, so: 20 text/plain; charset=iso-8859-1; size=1345 20 text/gemini; size=2003; modified=2019-08-18T17:15-05:00 -spc
> I had a thought last night in that there does, in fact, exist a way to > include the information without changing the syntax of the protocol. The > MIME type that is returned can have parameters, and these parameters can act > as small headers, so: > > 20 text/plain; charset=iso-8859-1; size=1345 > 20 text/gemini; size=2003; modified=2019-08-18T17:15-05:00 > Strictly speaking, those parameters are supposed to be limited to the ones defined in the RFC for the corresponding type, right? Not that we're necessarily playing by all the rules here. text/gemini is an unregistered, experimental type and so I think strictly speaking we should be text/x.gemini or text/prs.gemini or something... Don't tell the IETF! -Solderpunk
It was thus said that the Great solderpunk once stated: > > I had a thought last night in that there does, in fact, exist a way to > > include the information without changing the syntax of the protocol. The > > MIME type that is returned can have parameters, and these parameters can act > > as small headers, so: > > > > 20 text/plain; charset=iso-8859-1; size=1345 > > 20 text/gemini; size=2003; modified=2019-08-18T17:15-05:00 > > > > Strictly speaking, those parameters are supposed to be limited to the > ones defined in the RFC for the corresponding type, right? Nope. From RFC-2045, section 5: MIME implementations must ignore any parameters whose names they do not recognize. Also, from the same section: The ordering of parameters is not significant. > Not that we're necessarily playing by all the rules here. text/gemini > is an unregistered, experimental type and so I think strictly speaking > we should be text/x.gemini or text/prs.gemini or something... Nope, RFC-6648 deprecates the whole "X-" thing. -spc
On Sun, Aug 18, 2019 at 11:36:41AM +0000, solderpunk wrote: > Yeah, I think we can leave this for now. It was a hypothetical concern > that somebody had. Not necessarily a bad one, but until it's observed > actually creating significant trouble for actual users on actual clients > I think we can just table this issue. If it does come up as a practical > concern, we can resume discussion of some of the ideas here. Well, I guess this is no longer quite so hypothetical, as Konpeito has given us large binary files over Gemini. I noticed that this was an unpleasant user experience to download via AV-98 (with the client appearing to "hang" for a long time during the download). And, in fact, I did receive an email from one person telling me they liked Gemini a lot, who mentioned that they had not been able to download the Konpeito album. I strongly suspect that, in fact, they just didn't realise their client was slowly downloading it. One possible solution to the user experience problem that doesn't require the client actually knowing the file size is to display, instead of a progress bar, some kind of "stuff is happening" indicator, e.g. an animated spinner (you know, of the /, -, \, | kind) accompanied by a count of KiB / MiB downloaded so far? That would at least make it obvious to users that something was happening. And a convention of specifying filesizes for large binary files in the link text would allow rough mental calculation of progress? Cheers, Solderpunk
---