💾 Archived View for gemi.dev › gemini-mailing-list › 000010.gmi captured on 2023-11-04 at 12:17:56. Gemini links have been rewritten to link to archived content
-=-=-=-=-=-=-
I see that mozz.us has added CGI support for Gemini, which makes two servers [1] (as far as I can see), so I think it's time to go into a bit of details about how I adapted RFC-3875 for Gemini. I should note that I don't expect Gemini servers to support this---I only did it because I could. First, a bit about the specification. This is just a light overview of the CGI specification and readers are encouraged to read RFC-3875 for the full details. It's a way to generate content for a server using an external process and was originally meant for the web. The spec itself is fairly agnostic to HTTP though, which means it's pretty easy to adapt to Gemini. The output from a script contains a header portion, followed by a blank line, followed by the body. The headers defined by RFC-3875 are: Content-Type: mandatory, contains MIME type Location: optional, contains absolute URL to new location Status: optional, contains status code and text phrase Other headers can be included but it's up to the CGI handler to deal with those (section 6.3.4 and 6.3.5). Unlike email or HTTP hheaders, any headers output from a CGI script MUST be one-per-line, which simplifies processing. If the Status: header isn't present, it defaults to '200 OK' (but more on this with respect to Gemini below). The input to a CGI program comes from environment variables. The following headers are defined: AUTH_TYPE authentication type, see below for discussion CONTENT_LENGTH length of incoming body CONETNT_TYPE media-type of incoming body GATEWAY_INTERFACE MUST be "CGI/1.1" PATH_INFO see below for discussion PATH_TRANSLATED see below for discussion QUERY_STRING MUST exist; if no query given, defaults to "" REMOTE_ADDR MUST exist REMOTE_HOST SHOULD exist; MAY be REMOTE_ADDR REMOTE_IDENT MAY exist, see RFC-1413 for info REMOTE_USER see below for discussion REQUEST_METHOD MUST exist SCRIPT_NAME MUST exist, path to script SERVER_NAME MUST exist SERVER_PORT MUST exist SERVER_PROTOCOL MUST exist SERVER_SOFTWARE MUST exist There may also exist protocol specific headers (for the web, these would be "HTTP_..." and for Gemini they should be "GEMINI_...") but these are implementation defined; there are none defined in the RFC. Other environment variables may also exist, depending upon the operating system or configuration and again are beyond the scope of the RFC. Per section 4.4, if the query string does NOT contain an unencoded '=' character, then the query should be broken up into words, splitting on an unencoded '+' character and passed to the CGI program as command line parameters; otherwise, if there is an unencoded '=' character, treat the query string a a list of name/value pairs, and DO NOT pass the query string as command line parameters. [2] Now for the Gemini-specific parts. Some of this only applies to my server implemeation and is marked as such; other bits should probably apply to other implementations to ensure a consistent approach to CGI programs under Gemini. My server receives a request and parses it. The path portion is then checked against a redirection list (temporary, permanent, gone [4]), then against a handler list [3]. Only after passing through that processing is the file system checked. As the processing proceeds down the filesystem, if a file is marked as executable (the server is Unix), then the server assumes it's a CGI program. The following environment variables are set: GATEWAY_INTERFACE "CGI/1.1" QUERY_STRING the raw query string per the request, or "" REMOTE_ADDR IP address of client REMOTE_HOST IP address of client REQUEST_METHOD "" (see discussion) SCRIPT_NAME URL path to script SERVER_NAME host of server SERVER_PORT port of server SERVER_PROTOCOL "GEMINI" SERVER_SOFTWARE "GLV-1.12556/1" REMOTE_IDENT not set CONTENT_TYPE not set CONTENT_LENGTH not set I don't support RFC-1413, so I don't bother setting REMOTE_IDENT. There's no body, so I don't bother setting CONTENT_TYPE or CONTENT_LENGTH. And after some thought, I decided to go with "" for REQUEST_METHOD. Gemini doesn't have the concept of requests with respect to HTTP. I mean, I suppoe "GET" *could* apply, but I didn't want to give the impression that Gemini would offer other methods, so to signal intent, I went with an empty string, given that REQUEST_METHOD is mandatory. I only set PATH_INFO and PATH_TRANSLATED in certain circumstances---when the path continues on past the CGI program in the request. For example: /test/cgi PATH_INFO/PATH_TRANSLATED NOT set /test/cgi/foo/bar PATH_INFO/PATH_TRANSLATED set Given the second exxample, they would be set as: PATH_INFO "/foo/bar" PATH_TRANSLATED "/home/spc/projects/gemini/root/foo/bar" (see RFC-3875 for more details). Since I also run my own website using Apache, you can configure the CGI script to get Apache style environment variables, and if enabled, the following will be set: DOCUMENT_ROOT top level directory of Gemini content CONTEXT_DOCUMENT_ROOT same as DOCUMENT_ROOT CONTEXT_PREFIX "" SCRIPT_FILENAME actual path to script on filesystem And because of that, I added support to run web-based CGI scripts and if configured (http mode), the following environment variables are also included (or changed): REQUEST_METHOD "GET" SERVER_PROTOCOL "HTTP/1.0" HTTP_ACCEPT "*/*" HTTP_CONNECTION "close" HTTP_HOST SERVER_NAME HTTP_REFERER "" HTTP_USER_AGENT "" I hardcoded the last two in case web-based CGI scripts rely upon those variables being set. I'll get to how I handle the output below [7]. I set AUTH_INFO and REMOTE_USER only if a client certificate is provided AND the configuration enables TLS environment variables to be set (this is specific to my server): AUTH_TYPE "Certificate" REMOTE_USER The comman name subfield from the client certificate subject field. If TLS environment variables are enabled, then I also add the following environment variables: TLS_CIPHER TLS_VERSION TLS_CLIENT_HASH TLS_CLIENT_ISSUER TLS_CLIENT_ISSUER_* (each subfield of the client issuer) TLS_CLIENT_SUBJECT TLS_CLIENT_SUBJECT_* (each subfield of the client subject) TLS_CLIENT_NOT_BEFORE TLS_CLIENT_NOT_AFTER TLS_CLIENT_REMAIN TLS_CLIENT_NOT_AFTER - TLS_CLIENT_NOT_BEFORE Also, if Apache style environment variables are configured, instead of the above TLS_* environment variables, the following Apache environment variables are set: SSL_CHIPER SSL_PROTOCOL SSL_CLIENT_I_DN SSL_CLIENT_I_DN_* (each subfield of the client issuer) SSL_CLIENT_S_DN SSL_CLIENT_S_DN_* (each subfield of the client subject) SSL_CLIENT_V_START SSL_CLIENT_V_END SSL_CLIENT_V_REMAIN SSL_CLIENT_V_END - SSL_CLIENT_V_START Once the variables are set, the program is executed, the output collected and the response is generated. The CGI module will automatically detect three digit status codes and translate appropriately (in order, first match used): HTTP status Gemini status --------- -------- 200-299 20 301 31 300-399 30 403 60 404 51 405 59 400-499 50 500-599 40 000-999 50 If a Gemini reponse of 20 is seen, then the Content-Type is used; if a Gemini response of 30-39 is seen, then Location: is used, otherwise the text portion of the status resply is used; any body for a status other than 20 is accepted but ignored (I specifically ignored 410---I don't have any scripts that return this status, and I can't see this status being returned by a CGI script, but it can be easily added). A lot of this was done by also running multiple tests under Apache with a CGI script to see what is and isn't set per environment variables. And I'm not saying that all Gemini CGI modules need to support, say, the Apache environment variables or the HTTP specific variables. Again, a lot of this was done just because I could and to also push the limits of what could be run with my CGI module. Anyway, that's it for this. I think it's enough. -spc [1] The other being gemini://gemini.conman.org/ [2] This requirement pushed me over the edge into doing a rewrite of my URL parsing code [6]. [3] On second thought, I think I should reverse that ordering. [4] Handlers are references to non-files. Examples on my server are the Hi-Lo guessing game, The King James Bible, the Quote of the Day and technically, the gRFC and torture test [5]. [5] These are technically files, but there is some additional processing going on. For the gRFC, they are parsed for date, current status and title. For the torture test, they contain response code and MIME types. [6] https://github.com/spc476/LPeg-Parsers/blob/e6e321995c512b9076dba452569 521cb4cb90cdf/url.lua [7] Now that I think about it, I should probably set: CONTENT_TYPE text/plain CONTENT_LENGTH 0 just to be safe.
Thanks for taking the time to write this up! I guess I'm a little surprised / confused to see how closely your implementation sticks to the HTTP version of CGI, i.e. by going to the extent of adding a meaningless dummy environment variable for REQUEST_METHOD because that variable MUST exist in HTTP CGI implementations. I understand this this approach (combined with translating HTTP status codes to Gemini codes) permits existing CGI scripts written for HTTP to be deployed over Gemini without modification, which I suppose is nice, but...I imagine the majority of existing CGI scripts generate HTML output, which, while perfectly cromulent to serve over Gemini, is not really supposed to be mainstream usage. Back when I started work on Shizaru I expected to one day add CGI support and my plan then was to actually *not* set REMOTE_ADDR at all (along with lots of other things, like stripping any Cookie headers out of the response). I see now this would be an RFC violation, but if I ever get around to this, I'll still do it. Shizaru is supposed to be about Doing the Right Thing and creating a subspace of the web where people can relax and feel safe. It's explicitly an opinionated piece of software and I guess you could even call it an activism project, so, well, RFC be damned. To the extent that I presume the majority of CGI scripts for Gemini will be written from scratch for Gemini and so backward compatibility isn't a problem, I wonder if it's worthwhile including something in the spec defining an explicit subset of RFC-3875 as being used for Gemini. This would let us get rid of stuff which is obviously not needed (like REQUEST_METHOD), and gives us the option of maybe getting rid of stuff that maybe we can agree we'd rather not have (like REMOTE_ADDR?). -Solderpunk
solderpunk <solderpunk at SDF.ORG> writes: > To the extent that I presume the majority of CGI scripts for Gemini will > be written from scratch for Gemini and so backward compatibility isn't > a problem, I wonder if it's worthwhile including something in the spec > defining an explicit subset of RFC-3875 as being used for Gemini. This > would let us get rid of stuff which is obviously not needed (like > REQUEST_METHOD), and gives us the option of maybe getting rid of stuff > that maybe we can agree we'd rather not have (like REMOTE_ADDR?). I agree with this, but think it ought to be in a separate standard document from the main protocol documentation. -- Jason McBrayer | ?Strange is the night where black stars rise, jmcbray at carcosa.net | and strange moons circle through the skies, | but stranger still is lost Carcosa.? | ? Robert W. Chambers,The King in Yellow
It was thus said that the Great solderpunk once stated: > Thanks for taking the time to write this up! > > I guess I'm a little surprised / confused to see how closely your > implementation sticks to the HTTP version of CGI, I would think that's now par for course 8-P > i.e. by going to the > extent of adding a meaningless dummy environment variable for > REQUEST_METHOD because that variable MUST exist in HTTP CGI > implementations. I did it to see just how difficult it would be, and in fact, it wasn't that difficult at all (well, supporting the command line options of RFC-3875 took a bit of work and I had to change how I parse URLs, but that was the only real snag). > I understand this this approach (combined with > translating HTTP status codes to Gemini codes) permits existing CGI > scripts written for HTTP to be deployed over Gemini without > modification, which I suppose is nice, but...I imagine the majority of > existing CGI scripts generate HTML output, which, while perfectly > cromulent to serve over Gemini, is not really supposed to be mainstream > usage. No, but I did have ulterior motives---my web-based blog is CGI based, and I can run it via this module (as long as I enable the http-environment variables). > Back when I started work on Shizaru I expected to one day add CGI > support and my plan then was to actually *not* set REMOTE_ADDR at all > (along with lots of other things, like stripping any Cookie headers out > of the response). I see now this would be an RFC violation, but if I > ever get around to this, I'll still do it. Shizaru is supposed to be > about Doing the Right Thing and creating a subspace of the web where > people can relax and feel safe. It's explicitly an opinionated piece of > software and I guess you could even call it an activism project, so, > well, RFC be damned. I might be inclined to set REMOTE_ADDR to "127.0.0.1" and call it a day. That way, any CGI scripts that rely upon that have something, and it doesn't leak any information. And nothing in the RFC state you *have* to set the HTTP related environment variables (or do what I do and set them to "" to prevent CGI scripts from bombing out). I could even do that for my CGI module for Gemini---technically, the CGI script doesn't get raw access to the socket, so the data is "technically" coming from the localhost ... > To the extent that I presume the majority of CGI scripts for Gemini will > be written from scratch for Gemini and so backward compatibility isn't > a problem, I wonder if it's worthwhile including something in the spec > defining an explicit subset of RFC-3875 as being used for Gemini. This > would let us get rid of stuff which is obviously not needed (like > REQUEST_METHOD), and gives us the option of maybe getting rid of stuff > that maybe we can agree we'd rather not have (like REMOTE_ADDR?). Here are the environment variables defined by RFC-3875, and some appropriate ways to handle them: AUTH_TYPE only if a user certificate, then set to "Certificate" CONTENT_LENGTH don't set (it's optional) CONTENT_TYPE don't set (it's optional) GATEWAY_INTERFACE "CGI/1.1" PATH_INFO required under certain conditions (see RFC) PATH_TRANSLATED required under certain conditions (see RFC) QUERY_STRING "" if no query sent, otherwise, query string (this is MANDATORY) REMOTE_ADDR "127.0.0.1" REMOTE_HOST "127.0.0.1" REMOTE_IDENT don't set (it's optional) REMOTE_USER only if a user certificate, then the CN field of the subject REQUEST_METHOD "" SCRIPT_NAME set (see RFC) SERVER_NAME set SERVER_PORT set SERVER_PROTOCOL "GEMINI" SERVER_SOFTWARE set If you *want* to run a web-based CGI script, then the following should probably be defined as: DOCUMENT_ROOT maybe set? (Apache sets this) REQUEST_METHOD "GET" SERVER_PROTOCOL "HTTP/1.0" and translate status codes appropriately. Also, one could define other environment variables starting with "GEMINI_" but I can't think of much that could be set (wait, maybe the TLS client certificate stuff, now that I think about it), but it's allowed under the RFC. Also, some critical environment variables used by the OS like PATH, LANG, etc. I did set up a way to define these (and others) on a global and per-script basis. You don't have to deviate from the RFC *that much.* -spc
---