💾 Archived View for gemi.dev › gemini-mailing-list › 000010.gmi captured on 2024-08-31 at 15:26:21. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

CGI suport for Gemini

📧 Messages: 4
🗣️ Authors: 3
📅 First Message: 2019-09-05 18:55
📅 Last Message: 2019-09-08 06:31

1. Sean Conner (sean (a) conman.org)

📅 Sent: 2019-09-05 18:55
📧 Message 1 of 4


  I see that mozz.us has added CGI support for Gemini, which makes two
servers [1] (as far as I can see), so I think it's time to go into a bit of
details about how I adapted RFC-3875 for Gemini.  I should note that I don't
expect Gemini servers to support this---I only did it because I could.

  First, a bit about the specification. This is just a light overview of the
CGI specification and readers are encouraged to read RFC-3875 for the full
details.  It's a way to generate content for a server using an external
process and was originally meant for the web.  The spec itself is fairly
agnostic to HTTP though, which means it's pretty easy to adapt to Gemini.

  The output from a script contains a header portion, followed by a blank
line, followed by the body.  The headers defined by RFC-3875 are:

	Content-Type:	mandatory, contains MIME type
	Location:	optional, contains absolute URL to new location
	Status:		optional, contains status code and text phrase

  Other headers can be included but it's up to the CGI handler to deal with
those (section 6.3.4 and 6.3.5).  Unlike email or HTTP hheaders, any headers
output from a CGI script MUST be one-per-line, which simplifies processing. 
If the Status: header isn't present, it defaults to '200 OK' (but more on
this with respect to Gemini below).

  The input to a CGI program comes from environment variables.  The
following headers are defined:

	AUTH_TYPE		authentication type, see below for discussion
	CONTENT_LENGTH		length of incoming body
	CONETNT_TYPE		media-type of incoming body
	GATEWAY_INTERFACE	MUST be "CGI/1.1"
	PATH_INFO		see below for discussion
	PATH_TRANSLATED		see below for discussion
	QUERY_STRING		MUST exist; if no query given, defaults to ""
	REMOTE_ADDR		MUST exist
	REMOTE_HOST		SHOULD exist; MAY be REMOTE_ADDR
	REMOTE_IDENT		MAY exist, see RFC-1413 for info
	REMOTE_USER		see below for discussion
	REQUEST_METHOD		MUST exist
	SCRIPT_NAME		MUST exist, path to script
	SERVER_NAME		MUST exist
	SERVER_PORT		MUST exist
	SERVER_PROTOCOL		MUST exist
	SERVER_SOFTWARE		MUST exist

  There may also exist protocol specific headers (for the web, these would
be "HTTP_..." and for Gemini they should be "GEMINI_...") but these are
implementation defined; there are none defined in the RFC.  Other
environment variables may also exist, depending upon the operating system or
configuration and again are beyond the scope of the RFC.

  Per section 4.4, if the query string does NOT contain an unencoded '='
character, then the query should be broken up into words, splitting on an
unencoded '+' character and passed to the CGI program as command line
parameters; otherwise, if there is an unencoded '=' character, treat the
query string a a list of name/value pairs, and DO NOT pass the query string
as command line parameters. [2]

  Now for the Gemini-specific parts.  Some of this only applies to my server
implemeation and is marked as such; other bits should probably apply to
other implementations to ensure a consistent approach to CGI programs under
Gemini.

  My server receives a request and parses it.  The path portion is then
checked against a redirection list (temporary, permanent, gone [4]), then
against a handler list [3].  Only after passing through that processing is
the file system checked.  As the processing proceeds down the filesystem, if
a file is marked as executable (the server is Unix), then the server assumes
it's a CGI program.  The following environment variables are set:

	GATEWAY_INTERFACE	"CGI/1.1"
	QUERY_STRING		the raw query string per the request, or ""
	REMOTE_ADDR		IP address of client
	REMOTE_HOST		IP address of client
	REQUEST_METHOD		"" (see discussion)
	SCRIPT_NAME		URL path to script
	SERVER_NAME		host of server
	SERVER_PORT		port of server
	SERVER_PROTOCOL		"GEMINI"
	SERVER_SOFTWARE		"GLV-1.12556/1"
	REMOTE_IDENT		not set
	CONTENT_TYPE		not set
	CONTENT_LENGTH		not set

  I don't support RFC-1413, so I don't bother setting REMOTE_IDENT.  There's
no body, so I don't bother setting CONTENT_TYPE or CONTENT_LENGTH.  And
after some thought, I decided to go with "" for REQUEST_METHOD.  Gemini
doesn't have the concept of requests with respect to HTTP.  I mean, I suppoe
"GET" *could* apply, but I didn't want to give the impression that Gemini
would offer other methods, so to signal intent, I went with an empty string,
given that REQUEST_METHOD is mandatory.  I only set PATH_INFO and
PATH_TRANSLATED in certain circumstances---when the path continues on past
the CGI program in the request.  For example:

	/test/cgi		PATH_INFO/PATH_TRANSLATED NOT set
	/test/cgi/foo/bar	PATH_INFO/PATH_TRANSLATED set

  Given the second exxample, they would be set as:

	PATH_INFO	"/foo/bar"
	PATH_TRANSLATED	"/home/spc/projects/gemini/root/foo/bar"

(see RFC-3875 for more details).

  Since I also run my own website using Apache, you can configure the CGI
script to get Apache style environment variables, and if enabled, the
following will be set:

	DOCUMENT_ROOT		top level directory of Gemini content
	CONTEXT_DOCUMENT_ROOT	same as DOCUMENT_ROOT
	CONTEXT_PREFIX		""
	SCRIPT_FILENAME		actual path to script on filesystem

  And because of that, I added support to run web-based CGI scripts and if
configured (http mode), the following environment variables are also
included (or changed):

	REQUEST_METHOD		"GET"
	SERVER_PROTOCOL		"HTTP/1.0"
	HTTP_ACCEPT		"*/*"
	HTTP_CONNECTION		"close"
	HTTP_HOST		SERVER_NAME
	HTTP_REFERER		""
	HTTP_USER_AGENT		""

  I hardcoded the last two in case web-based CGI scripts rely upon those
variables being set.  I'll get to how I handle the output below [7].

  I set AUTH_INFO and REMOTE_USER only if a client certificate is provided
AND the configuration enables TLS environment variables to be set (this is
specific to my server):

	AUTH_TYPE	"Certificate"
	REMOTE_USER	The comman name subfield from the client certificate
			subject field.

  If TLS environment variables are enabled, then I also add the following
environment variables:

	TLS_CIPHER
	TLS_VERSION
	TLS_CLIENT_HASH
	TLS_CLIENT_ISSUER
	TLS_CLIENT_ISSUER_*	(each subfield of the client issuer)
	TLS_CLIENT_SUBJECT
	TLS_CLIENT_SUBJECT_*	(each subfield of the client subject)
	TLS_CLIENT_NOT_BEFORE
	TLS_CLIENT_NOT_AFTER
	TLS_CLIENT_REMAIN	TLS_CLIENT_NOT_AFTER - TLS_CLIENT_NOT_BEFORE

  Also, if Apache style environment variables are configured, instead of the
above TLS_* environment variables, the following Apache environment
variables are set:

	SSL_CHIPER
	SSL_PROTOCOL
	SSL_CLIENT_I_DN
	SSL_CLIENT_I_DN_*	(each subfield of the client issuer)
	SSL_CLIENT_S_DN
	SSL_CLIENT_S_DN_*	(each subfield of the client subject)
	SSL_CLIENT_V_START
	SSL_CLIENT_V_END
	SSL_CLIENT_V_REMAIN	SSL_CLIENT_V_END - SSL_CLIENT_V_START

  Once the variables are set, the program is executed, the output collected
and the response is generated.  The CGI module will automatically detect
three digit status codes and translate appropriately (in order, first match
used):

	HTTP status	Gemini status
	---------	--------
	200-299		20
	301		31
	300-399		30
	403		60
	404		51
	405		59
	400-499		50
	500-599		40
	000-999		50

  If a Gemini reponse of 20 is seen, then the Content-Type is used; if a
Gemini response of 30-39 is seen, then Location: is used, otherwise the text
portion of the status resply is used; any body for a status other than 20 is
accepted but ignored (I specifically ignored 410---I don't have any scripts
that return this status, and I can't see this status being returned by a CGI
script, but it can be easily added).

  A lot of this was done by also running multiple tests under Apache with a
CGI script to see what is and isn't set per environment variables.  And I'm
not saying that all Gemini CGI modules need to support, say, the Apache
environment variables or the HTTP specific variables.  Again, a lot of this
was done just because I could and to also push the limits of what could be
run with my CGI module.

  Anyway, that's it for this.  I think it's enough.

  -spc

[1]	The other being gemini://gemini.conman.org/

[2]	This requirement pushed me over the edge into doing a rewrite of my
	URL parsing code [6].

[3]	On second thought, I think I should reverse that ordering.

[4]	Handlers are references to non-files.  Examples on my server are the
	Hi-Lo guessing game, The King James Bible, the Quote of the Day and
	technically, the gRFC and torture test [5].

[5]	These are technically files, but there is some additional processing
	going on.  For the gRFC, they are parsed for date, current status
	and title.  For the torture test, they contain response code and
	MIME types.

[6]	https://github.com/spc476/LPeg-Parsers/blob/e6e321995c512b9076dba452569
521cb4cb90cdf/url.lua

[7]	Now that I think about it, I should probably set:

		CONTENT_TYPE	text/plain
		CONTENT_LENGTH	0

	just to be safe.

Link to individual message.

2. solderpunk (solderpunk (a) SDF.ORG)

📅 Sent: 2019-09-07 15:06
📧 Message 2 of 4

Thanks for taking the time to write this up!

I guess I'm a little surprised / confused to see how closely your
implementation sticks to the HTTP version of CGI, i.e. by going to the
extent of adding a meaningless dummy environment variable for
REQUEST_METHOD because that variable MUST exist in HTTP CGI
implementations.  I understand this this approach (combined with
translating HTTP status codes to Gemini codes) permits existing CGI
scripts written for HTTP to be deployed over Gemini without
modification, which I suppose is nice, but...I imagine the majority of
existing CGI scripts generate HTML output, which, while perfectly
cromulent to serve over Gemini, is not really supposed to be mainstream
usage.

Back when I started work on Shizaru I expected to one day add CGI
support and my plan then was to actually *not* set REMOTE_ADDR at all
(along with lots of other things, like stripping any Cookie headers out
of the response).  I see now this would be an RFC violation, but if I
ever get around to this, I'll still do it.  Shizaru is supposed to be
about Doing the Right Thing and creating a subspace of the web where
people can relax and feel safe.  It's explicitly an opinionated piece of
software and I guess you could even call it an activism project, so,
well, RFC be damned.

To the extent that I presume the majority of CGI scripts for Gemini will
be written from scratch for Gemini and so backward compatibility isn't
a problem, I wonder if it's worthwhile including something in the spec
defining an explicit subset of RFC-3875 as being used for Gemini.  This
would let us get rid of stuff which is obviously not needed (like
REQUEST_METHOD), and gives us the option of maybe getting rid of stuff
that maybe we can agree we'd rather not have (like REMOTE_ADDR?).

-Solderpunk

Link to individual message.

3. Jason McBrayer (jmcbray (a) carcosa.net)

📅 Sent: 2019-09-07 20:41
📧 Message 3 of 4

solderpunk <solderpunk at SDF.ORG> writes:

> To the extent that I presume the majority of CGI scripts for Gemini will
> be written from scratch for Gemini and so backward compatibility isn't
> a problem, I wonder if it's worthwhile including something in the spec
> defining an explicit subset of RFC-3875 as being used for Gemini.  This
> would let us get rid of stuff which is obviously not needed (like
> REQUEST_METHOD), and gives us the option of maybe getting rid of stuff
> that maybe we can agree we'd rather not have (like REMOTE_ADDR?).

I agree with this, but think it ought to be in a separate standard
document from the main protocol documentation.

-- 
Jason McBrayer      | ?Strange is the night where black stars rise,
jmcbray at carcosa.net | and strange moons circle through the skies,
                    | but stranger still is lost Carcosa.?
                    | ? Robert W. Chambers,The King in Yellow

Link to individual message.

4. Sean Conner (sean (a) conman.org)

📅 Sent: 2019-09-08 06:31
📧 Message 4 of 4

It was thus said that the Great solderpunk once stated:
> Thanks for taking the time to write this up!
> 
> I guess I'm a little surprised / confused to see how closely your
> implementation sticks to the HTTP version of CGI, 

  I would think that's now par for course 8-P

> i.e. by going to the
> extent of adding a meaningless dummy environment variable for
> REQUEST_METHOD because that variable MUST exist in HTTP CGI
> implementations.  

  I did it to see just how difficult it would be, and in fact, it wasn't
that difficult at all (well, supporting the command line options of RFC-3875
took a bit of work and I had to change how I parse URLs, but that was the
only real snag).

> I understand this this approach (combined with
> translating HTTP status codes to Gemini codes) permits existing CGI
> scripts written for HTTP to be deployed over Gemini without
> modification, which I suppose is nice, but...I imagine the majority of
> existing CGI scripts generate HTML output, which, while perfectly
> cromulent to serve over Gemini, is not really supposed to be mainstream
> usage.

  No, but I did have ulterior motives---my web-based blog is CGI based, and
I can run it via this module (as long as I enable the http-environment
variables).

> Back when I started work on Shizaru I expected to one day add CGI
> support and my plan then was to actually *not* set REMOTE_ADDR at all
> (along with lots of other things, like stripping any Cookie headers out
> of the response).  I see now this would be an RFC violation, but if I
> ever get around to this, I'll still do it.  Shizaru is supposed to be
> about Doing the Right Thing and creating a subspace of the web where
> people can relax and feel safe.  It's explicitly an opinionated piece of
> software and I guess you could even call it an activism project, so,
> well, RFC be damned.

  I might be inclined to set REMOTE_ADDR to "127.0.0.1" and call it a day. 
That way, any CGI scripts that rely upon that have something, and it doesn't
leak any information.  And nothing in the RFC state you *have* to set the
HTTP related environment variables (or do what I do and set them to "" to
prevent CGI scripts from bombing out).

  I could even do that for my CGI module for Gemini---technically, the CGI
script doesn't get raw access to the socket, so the data is "technically"
coming from the localhost ... 

> To the extent that I presume the majority of CGI scripts for Gemini will
> be written from scratch for Gemini and so backward compatibility isn't
> a problem, I wonder if it's worthwhile including something in the spec
> defining an explicit subset of RFC-3875 as being used for Gemini.  This
> would let us get rid of stuff which is obviously not needed (like
> REQUEST_METHOD), and gives us the option of maybe getting rid of stuff
> that maybe we can agree we'd rather not have (like REMOTE_ADDR?).

  Here are the environment variables defined by RFC-3875, and some
appropriate ways to handle them:

	AUTH_TYPE		only if a user certificate, then set to "Certificate"
	CONTENT_LENGTH		don't set (it's optional)
	CONTENT_TYPE		don't set (it's optional)
	GATEWAY_INTERFACE	"CGI/1.1"
	PATH_INFO		required under certain conditions (see RFC)
	PATH_TRANSLATED		required under certain conditions (see RFC)
	QUERY_STRING		"" if no query sent, otherwise, query string (this is MANDATORY)
	REMOTE_ADDR		"127.0.0.1"
	REMOTE_HOST		"127.0.0.1"
	REMOTE_IDENT		don't set (it's optional)
	REMOTE_USER		only if a user certificate, then the CN field of the subject
	REQUEST_METHOD		""
	SCRIPT_NAME		set (see RFC)
	SERVER_NAME		set
	SERVER_PORT		set
	SERVER_PROTOCOL		"GEMINI"
	SERVER_SOFTWARE		set

  If you *want* to run a web-based CGI script, then the following should
probably be defined as:

	DOCUMENT_ROOT		maybe set? (Apache sets this)
	REQUEST_METHOD		"GET"
	SERVER_PROTOCOL		"HTTP/1.0"

and translate status codes appropriately.
	
  Also, one could define other environment variables starting with "GEMINI_"
but I can't think of much that could be set (wait, maybe the TLS client
certificate stuff, now that I think about it), but it's allowed under the
RFC.  Also, some critical environment variables used by the OS like PATH,
LANG, etc.  I did set up a way to define these (and others) on a global and
per-script basis.

  You don't have to deviate from the RFC *that much.*

  -spc

Link to individual message.

---

Previous Thread: Alive

Next Thread: Defining sections in Gemini maps?