[spec] Adapting the HTTP Common Logging Format for use by Gemini servers

1. John Cowan (cowan (a) ccil.org)

The Common Logging Format as defined by Apache and other HTML servers
contains one line per client request divided into seven fields separated by
a single space.  They are:

1) IP address, either IPv4 or IPv6

2) Hostname of the client, or "-" if not known.

3) Name of the user, or "-" if not known.

4) Date in square brackets, in the form [10/Oct/2000:13:55:36 -0700].

5) Request line in double quotes.

6) Status code of response.

7) Number of bytes in the response body, or 0 if none.

I think there are two reasonable approaches to adapting this format to
Gemini, the "as compatible as possible" or "ACAP" approach, and the
"literal" approach.  In either approach, fields 1 and 7 are just as in
HTTP, and fields 2 and 3 are just "-".

On the ACAP approach, field 4 uses the date format above, field 5 contains
GET followed by the path segment of the URL followed by HTTP/1.1 (all space
separated), and field 6 contains the Gemini code converted to an equivalent
HTTP code (e.g. 20 becomes 200).  I'll work out the full equivalence later
if people like this.

On the literal approach, field 4 is ISO 8601 (RFC 3336) format, field 5 is
the URL request line (no quotes needed), and field 6 is the Gemini status
code unconverted.

The advantage of the ACAP approach is that it allows existing HTTP log
analyzers to be used.  The literal approach keeps all available information
but will need its own analysis tools.  Of course, a server can support both
log formats as well as any other formats desired, so the question is which
format is Best Practice if only one is provided.  It's possible to convert
literal format to ACAP format after the fact, but not vice versa.



John Cowan          http://vrici.lojban.org/~cowan        cowan at ccil.org
    "Mr. Lane, if you ever wish anything that I can do, all you will have
        to do will be to send me a telegram asking and it will be done."
    "Mr. Hearst, if you ever get a telegram from me asking you to do
        anything, you can put the telegram down as a forgery."
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201227/bb62
fb83/attachment-0001.htm>

Link to individual message.

2. Solderpunk (solderpunk (a) posteo.net)

Historically, efforts to establish a common logging format for Gemini
have not been well received.  I think there's already too much diversity
out there and server authors are reluctant to change.  The question of
whether or not IP addresses should be routinely logged also usually
proves quite divisive.

I still think there's value in such a standard (as it allows for
reusable log processing tools), but I definitely think it's out of scope
for the protocol spec proper and belongs in a companion spec.  I think
those are better suited to [tech] than [spec]?

For the record, I don't like that the Apache format uses spaces as a
field separator when spaces also occur inside the date.  Molly Brown's
log format uses tabs as separators, so it works very nicely with the
standard `cut` utility.  I use `cut`, `grep`, `sort`, `uniq` and `wc -l`
in short pipelines to run queries on my logs, and really enjoy being
able to do so.

Cheers,
Solderpunk

Link to individual message.

3. Arav K. (nothien (a) uber.space)

On Sun, Dec 27, 2020 at 02:59:02PM -0500, John Cowan wrote:
> On the literal approach, field 4 is ISO 8601 (RFC 3336) format, field 5 is
> the URL request line (no quotes needed), and field 6 is the Gemini status
> code unconverted.

We want to be careful about malicious clients sending a request like
'\n<garbage or fake log here>'.  Although that may fail, it would still
show up in the logs and mess them up.  Perhaps the logger should check
if the request line is a proper URL, and if it is not it would encode it
in some way (perhaps just URL-encoding it, because that function may
already be available to the code).

~aravk | ~nothien
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201227/e5f1
c805/attachment.sig>

Link to individual message.

4. CΓ΄me Chilliet (come (a) chilliet.eu)

The main differences with what my server is doing are:


important (especially for errors)

Link to individual message.

5. colecmac (a) protonmail.com (colecmac (a) protonmail.com)

> * I do not log the IP but its sha1 hash, because of privacy concerns

Doesn't this provide no security though? It's trivial to hash all IPv4
addresses and compare them. Additionally, this doesn't provide any
security to clients, because they can't guarantee this is in effect.

makeworld

Link to individual message.

6. Petite Abeille (petite.abeille (a) gmail.com)



> On Dec 27, 2020, at 20:59, John Cowan <cowan at ccil.org> wrote:
> 
> The Common Logging Format as defined by Apache and other HTML servers 
contains one line per client request divided into seven fields separated 
by a single space.  They are:
> 

IMO, that's should be left to the implementations. Their choice. Whatever is convenient. 

Doesn't hurt to point to the Common Logging Format as a FYI. And even 
suggest a mapping, the same way the CGI spec has been shoehorned back into Gemini. 

But that really seem to be an implementation detail.

Link to individual message.

7. Petite Abeille (petite.abeille (a) gmail.com)



> On Dec 27, 2020, at 22:48, colecmac at protonmail.com wrote:
> 
> Doesn't this provide no security though? It's trivial to hash all IPv4
> addresses and compare them. Additionally, this doesn't provide any
> security to clients, because they can't guarantee this is in effect.

Genau. Privacy by obscurity is no privacy at all.

Furthermore, TLS leaves a big, fat digital signature trail. 

In any case, best to leave such details to the implementations.

Link to individual message.

8. Petite Abeille (petite.abeille (a) gmail.com)



> On Dec 27, 2020, at 22:53, Petite Abeille <petite.abeille at gmail.com> wrote:
> 
> Furthermore, TLS leaves a big, fat digital signature trail. 

Previously on Gemini:

https://tlsfingerprint.io
https://tlsfingerprint.io/static/frolov2019.pdf

Related to:
https://ssd.eff.org/en/module/what-fingerprinting

While Gemini has far fever moving part than HTTP, it still has some.

I'm not a privacy expert though, so not sure how practical this all is. 
But a trail is a trail :)

Link to individual message.

9. Sean Conner (sean (a) conman.org)

It was thus said that the Great Solderpunk once stated:
> 
> For the record, I don't like that the Apache format uses spaces as a
> field separator when spaces also occur inside the date.  Molly Brown's
> log format uses tabs as separators, so it works very nicely with the
> standard `cut` utility.  I use `cut`, `grep`, `sort`, `uniq` and `wc -l`
> in short pipelines to run queries on my logs, and really enjoy being
> able to do so.

  My own logging format is:

	remote=XXX.XXX.XXX.XXX status=20 
request="gemini://gemini.conman.org/boston/2001/11/13.1" bytes=1540 subject="" issuer=""

(I've redacted the IP address)

  The final two fields record information about the client certificate to
help debug issues with my server.  Here's an example:

	remote=XXX.XXX.XXX.XXX status=20 
request="gemini://gemini.conman.org/private/" bytes=333 
subject="/CN=default" issuer="/CN=default"

  I did not change the subject or issuer.  It's been interesting to see
what's being sent in client certificates.

  -spc

Link to individual message.

10. CΓ΄me Chilliet (come (a) chilliet.eu)

Le dimanche 27 d?cembre 2020, 22:48:13 CET colecmac at protonmail.com a ?crit :
> > * I do not log the IP but its sha1 hash, because of privacy concerns
> 
> Doesn't this provide no security though? It's trivial to hash all IPv4
> addresses and compare them. Additionally, this doesn't provide any
> security to clients, because they can't guarantee this is in effect.

It?s not for clients, it?s for me. I?m not sure what I am legally allowed 
to do with IPs so I feel more confident not storing them.
I sha1 IPs the same whether they are v4 or v6. It may indeed be easy to do 
a dictonnary attack for v4 log entries, but I?m not sure what I can do about that.

C?me

Link to individual message.

11. Petite Abeille (petite.abeille (a) gmail.com)



> On Dec 28, 2020, at 12:45, C?me Chilliet <come at chilliet.eu> wrote:
> 
> I sha1 IPs the same whether they are v4 or v6. It may indeed be easy to 
do a dictonnary attack for v4 log entries, but I?m not sure what I can do about that.

See https://en.wikipedia.org/wiki/Rainbow_table#Defense_against_rainbow_tables

Link to individual message.

12. Philip Linde (linde.philip (a) gmail.com)

On Sun, 27 Dec 2020 21:39:41 +0100
C?me Chilliet <come at chilliet.eu> wrote:

> * I do not log the IP but its sha1 hash, because of privacy concerns

Please note that the table of the sha-1 of the entire IPv4 address space
is ~80 GiB and that such a measure can easily be reversed if not
individually salted before hashing (after which comparing hashes in
log entries is useless), even if I have to resort to searching the
whole IPv4 address space. You should *not* depend on this measure where
you have a real need for privacy.

-- 
Philip
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 488 bytes
Desc: not available
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201228/d356
e3e6/attachment-0001.sig>

Link to individual message.

13. Stephane Bortzmeyer (stephane (a) sources.org)

On Sun, Dec 27, 2020 at 09:04:01PM +0100,
 Solderpunk <solderpunk at posteo.net> wrote 
 a message of 20 lines which said:

> The question of whether or not IP addresses should be routinely
> logged also usually proves quite divisive.

By the way, *if* you log IP addresses (this is a big IF), in a world
of NAT and CGNAT, you should also log the port, as requested by RFC
6302 <gemini://gemini.bortzmeyer.org/rfc-mirror/rfc6302.txt>

Link to individual message.

---

Previous Thread: IETF policy on encodings and languages

Next Thread: What is required to be IRI compliant?