💾 Archived View for gemini.bvnf.space › blog › 004_counting_with_unix.gmi captured on 2023-07-10 at 13:17:28. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2023-03-20)
-=-=-=-=-=-=-
Let's construct a pipeline to parse some line-based text data!
Network connections and TLS for this gemini server are handled by relayd(8) (see my first blog post), which by default logs some basic information for each request. It spits out something like the following into /var/log/daemon:
Nov 15 23:52:48 bvnf relayd[89123]: relay gemini, session 170 (1 active), 0, XXX.XXX.XXX.XXX -> 127.0.0.1:11965, done Nov 16 00:22:19 bvnf relayd[21497]: relay gemini, session 186 (1 active), 0, YYY.YYY.YYY.YYY -> 127.0.0.1:11965, done Nov 16 09:49:01 bvnf relayd[89123]: relay gemini, session 170 (1 active), 0, XXX.XXX.XXX.XXX -> 127.0.0.1:11965, done
(XXX.XXX.XXX.XXX is a cleverly obfuscated IP address).
first blog post: setting up vger(8) on OpenBSD
Since setting up the server, let's have a look at how many unique visitors the server is getting.
Firstly, extract the bits we want from the log:
$ awk '/relay gemini/ { printf("%s %s %s %s\n", $1, $2, $3, $13) } ' < /var/log/daemon | tee tmp Nov 15 23:52:48 XXX.XXX.XXX.XXX Nov 16 00:22:18 YYY.YYY.YYY.YYY Nov 16 06:49:01 XXX.XXX.XXX.XXX
Now filter out lines with duplicate IP addresses:
$ sort -uk 4 < tmp | tee tmp.2 Nov 15 23:52:48 XXX.XXX.XXX.XXX Nov 16 00:22:18 YYY.YYY.YYY.YYY
So we have a list of all the times an IP address first made a request to the server.
Now, we need some way to visualise these data; there are many options but for this post, let's use Python.
So that Python can read the dates as the correct data type, we make our dates look more approachable. We could try to put them into Unix time, but lets try ISO 8601-ish.
https://armaanb.net/iso8601.html
Some options:
Let's be (mostly) portable, and go for sed. Our task involves changing something like "Nov 15" to "2021-11-15".
$ sed 's/Oct/10/;s/Nov/11/;s/Dec/12/;' < tmp.2 11 15 23:52:48 XXX.XXX.XXX.XXX 11 16 00:22:18 YYY.YYY.YYY.YYY
It's a bit messy, as currently this code isn't going to work after December this year. However, we are aware of that potential problem and it's ok, since I'm only doing this for fun, in November.
Now add in the year and change some spaces for hyphens:
$ sed 's/Oct/10/;s/Nov/11/;s/Dec/12/;s/^\(..\) /2021-\1-/' < tmp.2 2021-11-15 23:52:48 XXX.XXX.XXX.XXX 2021-11-16 00:22:18 YYY.YYY.YYY.YYY
Nice! One last thing: days in the log whose day number is less than 10 have been written using a single digit, which isn't the ISO way.
Pad those:
$ sed 's/Oct/10/;s/Nov/11/;s/Dec/12/;s/^\(..\) /2021-\1-/;' < tmp.2 \ | sed 's/-\([0-9]\) /-0\1 /' 2021-11-15 23:52:48 XXX.XXX.XXX.XXX 2021-11-16 00:22:18 YYY.YYY.YYY.YYY
Good! Note that we could move some of these calls around to make the whole pipeline a bit quicker:
$ awk '/relay gemini/ {printf("2021-%s-%s %s %s\n", $1, $2, $3, $13) }' < /var/log/daemon \ | sort -uk 4 \ | sed 's/Nov/11/;s/Oct/10/;s/Dec/12/;s/-\([0-9]\) /-0\1 /' \ | sort > tmp.3
The ISO format is particularly useful for sorting correctly.
Now, to make a graph, we don't care about which specific IP addresses are connecting when, so get rid of them:
cut -d' ' -f 1-2 < tmp.3 | tee tmp.4 2021-11-15 23:52:48 2021-11-16 00:22:18
Now Python makes it easy:
import numpy as np import matplotlib.pyplot as plt with open("tmp.4", "r") as f: bytes = f.read() strtimes = bytes.split('\n')[:-1] times = [np.datetime64(i) for i in strtimes] times = sorted(times) plt.hist(times, bins=23) plt.savefig("../media/004-unique-visitors.pdf")
It's not beautiful, but I can see what I wanted to. There was a wave of new people around the 31st October, just after I added a link to the server from my website, and then a bigger peak two days ago when I added a link to the blog on ew0k's antenna. Between these two events there's some noise, probably from random crawlers and me on various devices.
commit adding the gemini link to https://bvnf.space/
--
written 2021-11-19