Logging format for Gemini servers

Anna โ€œCyberTailorโ€ <cyber (a) sysrq.in>

Hello everyone, today I'd like to talk about access logs.

Almost every HTTP server uses NCSA Common Log Format (or its superset -
Combined Log Format). This is very cool, because developers of misc
utilities (like fail2ban or monitoring tools) don't need to bother
writing log parsers for each server.

## Example log entry

 .---------------------- IP address of the client which made the request
 |         .------------ rfc1413 identity (always "-" in practice)
 |         | .---------- authorized user ID (as in .htpasswd file)
 |         | |     .---- datetime string [%d/%b/%Y:%H:%M:%S %z]
 |         | |     |
 |         | |     |
 *         * *     *
 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326
                                                                            *   *    *
                                                                            |   |    |
                                                                            |   |    |
                         HTTP method, resource and protocol version --------.   |    |
                         HTTP status code returned to the client ---------------.    |
                         number of bytes of data transferred (without headers) ------.

References:
=> https://en.wikipedia.org/wiki/Common_Log_Format
=> https://publib.boulder.ibm.com/tividd/td/ITWSA/ITWSA_info45/en_US/HTML/g
uide/c-logs.html#common
=> https://www.loganalyzer.net/log-analyzer/apache-common-log.html

## Adaptibility

If you look at Gophernicus code, it's using Combined Log Format, which
is nice but confusing (I mean seeing "HTTP/1.0" string and HTTP status
codes in a Gopher server's log feels weird), however compatibility is
worth it.

I think Common Log Format can be applied for Gemini too. The only
problem is, such format does not include <META>. Also it won't look good
in syslog because of double datetime.

Let's review the syntax:

> host ident authuser date request status bytes

Everything is obvious except authuser. I suggest using last 7 characters
of client certificate's SHA-1 cache (git had shown that it is enough).

## RFC 1413: Ident protocol

If you run a webserver, you probably understand how useful User-agent is
for identifying robots visiting your website.

Thankfully, Gemini doesn't require client identification as there're no
compatibility issues between different Gemini clients. But that makes
learning anything about robots very hard for capsule operators :(

I appreciate St?phane Bortzmeyer for including additional info in
robots.txt requests:

> gemini://example.space/robots.txt?robot=true&uri=gemini://gemini.bortzmey
er.org/software/lupa/

I'd like to suggest another one solution for this problem (so we have 15
competing standards later).

Let's suppose Yuri runs a Gemini server, Sergei runs a Gemini search
egnine *AND* an identd server, for example, fakeidentd:
=> http://www.guru-group.fi/~too/sw/ A static, secure identd. One source file only!

Sergei's crawler makes a request to Yuri's server. Yuri's server sends
an ident query to Sergei's identd server, reads response and writes
access log. Yuri reads 'celestial-crawler' in the logs and gets excited
about his capsule getting indexed.

Upsides:

  feels more comfy and personal
=> https://tvtropes.org/pmwiki/pmwiki.php/Main/KilroyWasHere

Downsides:

  blocking main thread (although separating logger and listener
  processes is a good idea as it's more secure)


What are you thoughts?
Feel free to ask questions ?

Link to individual message.

Anna โ€œCyberTailorโ€ <cyber (a) sysrq.in>

Some missing stuff

### References for ident protocol:

Specification
=> gemini://gemini.bortzmeyer.org/rfc-mirror/rfc1413.txt

Wikipedia article (see "Software" section for identd servers)
=> https://en.wikipedia.org/wiki/Ident_Protocol

## Sample query

 ```python
import socket
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as client:
    client.settimeout(1)
    client.connect((foreign_host, 113))
    client.send(f"{foreign_port},{local_port}\r\n".encode())
    result = client.recv(4096).decode().split(":")[-1].strip()
 ```

Link to individual message.

Stephane Bortzmeyer <stephane (a) sources.org>

On Mon, Jul 19, 2021 at 03:22:49AM +0500,
 Anna ?CyberTailor? <cyber at sysrq.in> wrote 
 a message of 91 lines which said:

> Almost every HTTP server uses NCSA Common Log Format (or its
> superset - Combined Log Format). This is very cool, because
> developers of misc utilities (like fail2ban or monitoring tools)
> don't need to bother writing log parsers for each server.

Yes, this is cool but it doesn't mean this format is perfect. The
biggest problem is that it logs the source IP address but not the
source port. Because of the importance of IP address sharing today in
the IPv4 world (RFC 6269
<gemini://gemini.bortzmeyer.org/rfc-mirror/rfc6269.txt>), logging just
the source IP address is a bad idea (RFC 6302
<gemini://gemini.bortzmeyer.org/rfc-mirror/rfc6302.txt> recommends,


Also, of course, there is the privacy issue. IMHO, Gemini servers
should offer an option to log only the first N bits of the source IP
address.

>  |         | |     .---- datetime string [%d/%b/%Y:%H:%M:%S %z]

RFC 3339 <gemini://gemini.bortzmeyer.org/rfc-mirror/rfc3339.txt>
format would have been a better idea.

> Thankfully, Gemini doesn't require client identification as there're no
> compatibility issues between different Gemini clients. But that makes
> learning anything about robots very hard for capsule operators :(

Indeed, this is a serious operational problem. There have been some
attempts to list all "good" robots somewhere but it was not a success.

> I appreciate St?phane Bortzmeyer for including additional info in
> robots.txt requests:
> 
> > gemini://example.space/robots.txt?robot=true&uri=gemini://gemini.bortzm
eyer.org/software/lupa/

Note that it breaks some Gemini servers
<https://framagit.org/bortzmeyer/lupa/-/issues/9>.

> Downsides:
> * identd probably won't work behind ISP's NAT
> * requires writing asynchronous or threaded server code to avoid
>   blocking main thread (although separating logger and listener
>   processes is a good idea as it's more secure)

Indeed. It seems to me there are serious limitations.

Link to individual message.

---

Previous Thread: Gemini chatroom

Next Thread: Deurbanising the Web