💾 Archived View for dmerej.info › en › blog › 0005-parsing-nginx-logs.gmi captured on 2024-05-12 at 15:12:27. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2022-07-16)
-=-=-=-=-=-=-
2016, Apr 10 - Dimitri Merejkowsky License: CC By 4.0
This article is a comment on an article I read some time ago.
It's called *Get rid of syslog (or a journald log filter in ~100 lines of Python)* and you can go read it here[1]
1: https://tim.siosm.fr/blog/2014/02/24/journald-log-scanner-python
It contains some advice about how to parse systemd logs, following the very good principles exposed in yet another blog post[2], called *The Six Dumbest Ideas in Computer Security*.
2: http://www.ranum.com/security/computer_security/editorials/dumb
Here's a short executive summary:
Since I own a `nginx` web server and I'm also concerned about security, I tried to apply the same principles to the access logs I get for anyone who access my domain.
By default, nginx logs look like this:
# Harmless log: 94.224.234.9 - - [10/Apr/2016:14:49:32 +0200] "GET /tweets/ HTTP/1.1" 200 247 "-" "Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0" # Probably someone trying to see if I'm running wordpress (nice try ...) 130.185.155.10 - - [09/Apr/2016:13:14:02 +0200] "GET /wp-login.php HTTP/1.1" 403 570 "-" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)"
The least you can say is that at first glance it does not look trivial to separate harmless logs from the rest.
Fortunately, (and as it's often the case) the answer is in the documentation
You can see that `nginx` folks have taken this use case into account:
# This example excludes requests with HTTP status codes 2xx (Success) and 3xx (Redirection) map $status $loggable { ~^[23] 0; default 1; } access_log /path/to/access.log combined if=$loggable;
The bad news is that this feature is only available for `nginx >= 1.7.0.`
But wait! There's an other way.
You can made the `nginx` logs much more readable by using something like:
log_format machine_readable '$time_local | ' '$status | ' '$remote_addr | ' '$request | ' '$http_user_agent | ' '$http_referer'; access_log /var/log/nginx/machine_readable.log machine_readable;
This makes sure every field in the log message is separated by a pipe (which seldom occurs in URLs, requests or user agents)
Then I build my own parser in Python, using the fact that the second field is the status:
for line in lines: # .... try: fields = [x.strip() for x in line.split("|")] date, status, ip, request, user_agent, referer = fields status_code = int(status) except (ValueError, IndexError) as e: # Not good, print it! print("WARNING: parsing log failed", e) print(line) continue if not harmless(status_code, ip, request, user_agent): print(line) # .... def harmless(status_code, ip, request, user_agent): # ... some checks using ip and user agent ... if status_code >= 200 and status_code < 400: return True # used by nginx if status_code == 499: return True # We've filtered all the goodness: return False
----