💾 Archived View for senders.io › gemlog › 2021-04-15-capsule-stats.gmi captured on 2023-04-19 at 22:36:37. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2021-12-04)
-=-=-=-=-=-=-
I was curious what the general traffic of my capsule was. I tinkered with the idea for my webserver actually setting up some sort of Elk (elastic, logstash, kibana) setup to get some monitoring and metrics on my actual server. But for gemini, where I actually HAVE traffic, I decided to just to have a live look at it.
I am running my own server whose access logging is in the syntax:
2021-04-15T02:41:04,899Z IN /67.86.nnn.nnn:33378 gemini://senders.io/feed/atom.xml 33 2021-04-15T02:41:04,907Z OUT 20 application/xml; lang=en; 3452 2021-04-15T02:41:04,950Z IN /67.86.nnn.nnn:33380 gemini://senders.io/gemlog/feed/atom.xml 40 2021-04-15T02:41:04,951Z OUT 20 application/xml; lang=en; 3467
These are tab separated lines broken down into two categories: IN and OUT.
IN logs are requests:
[timestamp] [tab] IN [IP] [tab] [URI] [tab] [SIZE]
OUT logs are responses:
[timestamp] [tab] OUT [STATUS] [tab] [META] [tab] [SIZE]
These are tab structured lines and it is pretty easy to just calculate some basic stats on incoming and outgoing messages using the wonderful world of bash scripting.
#!/usr/bin/env bash LOGFILE=$1 OUTFILE=$2 if [ $# -lt 2 ]; then echo "Usage: ./calc.sh logs/access.log gemini/stats.gmi " fi # Stats for today TODAY=$(date -Id) echo -e "Stats for day:\t$TODAY" > $OUTFILE echo -e " Total Reqs:\t"$(grep 'OUT' ${LOGFILE} | grep "${TODAY}" | wc -l) >> $OUTFILE echo -e " Gemlog Reads:\t"$(grep 'IN' ${LOGFILE} | grep "${TODAY}" | grep "gemlog" | grep "gmi" | wc -l) >> $OUTFILE echo "Top 5 Gemlogs" >> $OUTFILE echo "--------------" >> $OUTFILE grep "IN" ${LOGFILE} | grep "${TODAY}" | cut -f4 | grep "gemlog" | grep ".gmi" | sort | uniq -c | sort -rn | head -n5 >> $OUTFILE # Stats total EARLIEST=$(head -n1 $LOGFILE | cut -f1) echo "" >> $OUTFILE echo -e " Stats since:\t$EARLIEST" >> $OUTFILE echo -e " Total Reqs:\t"$(grep 'OUT' ${LOGFILE} | wc -l) >> $OUTFILE echo -e " Gemlog Reads:\t"$(grep 'IN' ${LOGFILE} | grep "gemlog" | grep "gmi" | wc -l) >> $OUTFILE echo "Top 5 Gemlogs" >> $OUTFILE echo "--------------" >> $OUTFILE grep "IN" ${LOGFILE} | cut -f4 | grep "gemlog" | grep ".gmi" | sort | uniq -c | sort -rn | head -n5 >> $OUTFILE # print generating timestamp echo -e "\n// generated $(date -u -Is)" >> $OUTFILE
This bash script is basically a combination of: grep, cut, sort, uniq. I know that I can optimize this much further, but I wrote this in a way where I filter down in steps to aid in my understanding of what and why I am filtering.
I also wrote the script to be run where I can change the in and out file, but that was a relic of this being something I ran locally and not to a fixed location on my server.
I decided to break information into two things: total requests - where I filter all log lines basically. And then "gemlog reads" since the homepage and atom.xml are things I don't really care about. But it's pretty good to see the percent of the requests are page reads. And I also decided to show the "from the beginning of the file" stats as well (originally I just was just calculating the stats for day).
Stats for day: 2021-04-14 Total Reqs: 301 Gemlog Reads: 155 Top 5 Gemlogs -------------- 53 gemini://senders.io/gemlog/2021-04-13-digital-hygiene-one-week-in.gmi 14 gemini://senders.io/gemlog/2021-04-09-humans-first-words.gmi 13 gemini://senders.io/gemlog/2021-04-12-girl-2020-land-before-time.gmi 7 gemini://senders.io/gemlog/2021-04-10-floc.gmi 7 gemini://senders.io/gemlog/2021-04-03-digital-hygiene.gmi Stats since: 2021-04-07T00:53:38,811Z Total Reqs: 3500 Gemlog Reads: 1852 Top 5 Gemlogs -------------- 239 gemini://senders.io/gemlog/2021-04-10-floc.gmi 207 gemini://senders.io/gemlog/2021-04-13-digital-hygiene-one-week-in.gmi 186 gemini://senders.io/gemlog/2021-04-07-devlog-4-deployed-in-production.gmi 173 gemini://senders.io/gemlog/2021-04-09-humans-first-words.gmi 138 gemini://senders.io/gemlog/2021-04-12-girl-2020-land-before-time.gmi // generated 2021-04-15T02:56:01+00:00
I run this via a cronjob because I don't have any CGI support to load these stats on demand on my server. I run the calc script every minute to write it to file on my capsule:
I found this a fun exercise to see how well a particular gemlog "was doing" - were people clicking into it? It's also interesting to see some traffic numbers on days (like today) where I haven't posted.
I realized upon writing this calc process - I probably should do something about the fact I am logging IPs onto my server outside of the EU, but I know some of you are IN the EU. I have a retention setup via cron to wipe my logs every month which if I recall should be compliant. But I might just remove the actual IP from the log and add a UUID on the IN and OUT so I can properly match up the IN and OUT lines. I really don't need your IP, nor would I want my IP sitting on some random server somewhere (though I am not subject to GPDR so I probably have no recourse to ask you to remove it). I know some of you run larger sites/capsules - what do you do about access logs? If this were HTTP it would probably make sense to keep the IP logs to capture/manage any traffic to monitor for potentially malicious actions and ban them etc? So just curious to not reinvent the wheel here...
I thought it was neat to take a look at the general traffic on my server and share the script :)