💾 Archived View for perplexing.space › 2020 › monitoring-silliness.gmi captured on 2021-12-03 at 14:04:38. Gemini links have been rewritten to link to archived content
-=-=-=-=-=-=-
2020-12-12
I enjoyed reading about "a gemini monitoring service"⁰ and while reading about the motivation for fewer dependencies than those required by Nagios I got to wondering how small you might be able to make such a service.
0: announcing a gemini monitoring service
Of course, once I had the idea I had to try it out and see. I've been interested in cross-platform compatibility so this seemed like a nice opportunity to dive into that as well. At this point I've been able to test things out on Debian, NetBSD, and OpenBSD in both bash and ksh.
The title of this post refers to my own quixotic attempts at shaving down dependencies and testing obscure platforms. I think Jon's project is an interesting one and if you're going to run a service yourself you should definitely use his. Not only is it more robust, but it is more polished with documentation and a real project page. This capsule post is more an exploration of a silly filesystem hack and intended for discussion more than use.
- bash OR ksh
- openssl OR gnutls-cli
And that should be about it! These more fully-featured shells are necessary for the option "pipefail" which should abort a pipeline on error. I've used the option to `read` which splits on a delimiter, which is nice to have but could conceivably be handled another way.
It wouldn't be fair to compare the two programs because I decided somewhat arbitrarily to not implement much of the configurability present in the original service. I have been on something of a kick lately reducing the number of knobs and dials in favor of sane defaults. As a part of this, I removed configurable timeouts and check frequency. I decided to let my implementation more closely reflect my use and administration of gemini services — best effort. As a result, the number of attempts to reach a server is arbitrarily set to 3, and the timeouts are defined by the operating system (on Linux this would be `tcp_syn_retries`). In practice it means exponential back-off corresponding to approximately 180 seconds.
Each host is attempted up to three times and rechecked after an hour. I think this might actually be too often, but it's a single line to change it. The "configuration" then is a single file of hosts to monitor and contact addresses to notify when errors occur.
perplexing.space mail@perplexing.space example.com admin@example.com
One feature I'm on the fence about and haven't implemented is notification of "recovery" — that is a service having been offline and later returning a successful response. This probably reflects my personality, but I receive those sorts emails elsewhere and derive no value from them. If something is offline I either already know about it or I'm going to look into it and another email is just noise.
Having said all of that; I am most pleased having realized the one marginally clever idea that I wanted to try. Instead of a database the configuration file is read for a hostname and contact email from which a temporary file is created named after the hostname and the contents set to the contact address. The check frequency is tracked through the `mtime` on the file. Initially each host is set in the past to trigger a check, after a check is made the mtime is set into the future (1 hour). A while loop polls whether the current time has passed these `mtimes` per-file before performing the next check.
The full program is about 100 lines of shell script and I think this is starting to push the limits of credulity. I've maintained much larger shell scripts in the past (10,000+ lines!) and there is a point very early on where maintainability is exchanged for continued development in the shell. While I like the approach I've achieved here, I don't really think it would translate well to a bigger language or program.