<-- back to the mailing list

Gemini server logging formats and practices

Sean Conner sean at conman.org

Tue May 12 22:39:08 BST 2020

- - - - - - - - - - - - - - - - - - - ```

It was thus said that the Great solderpunk once stated:
> On Mon, May 11, 2020 at 05:21:01AM -0400, Sean Conner wrote:
>  
> 
>   But there's really not much to log, other than remote address, request,
> 
> status, and potentially the issuer/subject of any given certificate (and
> 
> even that might be optional).  
> 
> I agree, there's not much to log, far less than HTTP.  But there's some
> real utility in having, err, utilities which can parse a log and
> generate basic statistics an admin might like to know: most popular
> resources, frequent requests resulting in 51 Not Found responses,
> average requests per day.  And people are more likely to write things
> like this if there's one format to worry about and not one per server.
> 
> This isn't a hard side-project by any means.  Something very simple and
> easy to read into existing data processing tools, like a comma separated
> or tab separated value file with standardised names and/or order for the
> columns and an agreed-upon representation of time would do the trick.
> 
> I am understanding of and sympathetic towards both admins who want to
> log IPs for debugging or abuse-detection purposes and towards those who
> don't want to so they can (rightfully) boast about their severs' respect
> for privacy.  So the standard format should include a column for remote
> IP and also have a clearly defined behaviour for anonymised logs which
> log analysers can recognise and handle gracefully (as simple as
> specifying a standard character, like "-", to be placed in that column).
> We could also define a half-way format, where a compact hash of the IP is
> logged, so that unique visitor statistics can be calcualted for those
> who want them, or e.g. malfunctioning bots can be spotted, but nothing.

  Okay, here's the format I'm using, and some notes about it:

remote=---.---.---.--- status=20 request="gemini://gemini.conman.org/private/" bytes=213 subject="/CN=AV-98 cert test" issuer="/CN=AV-98 cert test" remote=---.---.---.--- status=20 request="gemini://gemini.conman.org/" bytes=3026 subject="" issuer=""

(NOTE: I'm blanking out the IP address)

  As stated before, I'm using syslog() for logging, so I'm not logging thetime since that's added by syslog().  The request, subject and issuer fieldsare quoted, so any double quote that appears inside has to be escaped with'\' [1].  The fields are named, so they are (assuming you know English)self-describing (to a degree).  I include bytes, but that's because I havethat information (even for dynamic output) due to how I wrote the server.

  Aside from the time field (which could be added, but for me, would beredundant since syslog() does that for me), I would say the pros of formatis:

	* it's self describing	* technically, the fields could be in any order	* the fields could be made optional	* new fields can be added (parsers should ignore fields they don't know)	* there are tools that already exist tha can handle field=value data	  (for example, Splunk, which we use at work)

The cons:

	* it takes more space due to the field names	* it's slightly more complex to parse [2]	* the double quote will need to be escaped in some fields [2]

  Thinking about it, a tool to process logs could require a module (orplugin) to handle reading the data, so for instance, I write a module forthis tool to handle GLV-1.12556, and you could write one for Molly Brown. The data these modules include should be (and format):

	timestamp	ISO-8601 format, e.g. 2020-05-12T17:33-0500	remote address 	opaque token (could be IPv4, IPv6, hash value)	request		string	bytes		number

and optionally:

	client certificate issuer	string	client certificate subject	string

  -spc (refuses to give up syslog)

[1]	Which for me, is handled automatically by Lua, the language my	server is written in.  

[2]	But not horrible either.  If you use a CSV format, you have to deal	with escaping commas that might appear in a field *anyway* [3].  At	least with a TSV (tab-separated values) there's less chance of	actual tabs being in the data, but you have to be careful when	parsing to handle two consecutive tabs (or two consecutive commas	when using CSV).

[3]	And if you think CSV is a well-defined format, you haven't been out	much.