💾 Archived View for thrig.me › blog › 2023 › 04 › 14 › normal.gmi captured on 2024-07-09 at 01:01:22. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2023-11-14)
-=-=-=-=-=-=-
If a system is unknown then it may need to be measured--what does normal look like? Computers can be pretty good at this. Maybe start with ad-hoc data collection, let's say the latency of ping to the local router.
$ ping -c 3600 192.168.42.1 | tee pl ... 3600 packets transmitted, 3600 packets received, 0.0% packet loss round-trip min/avg/max/std-dev = 0.856/1.657/214.307/4.073 ms
Measurement should be (but often is not) done before updating hardware or software; often one will simply update to the next version, and then vague reports may trickle in of things being slower. Maybe there were relevant changes in the network stack, or maybe something else broke. With a before and after measurement, you at least have something to compare, maybe without the trouble of downgrading a system to a previous version, as complicated by the update maybe having changed the firmware on you.
Not much statistics beyond min/avg/max/std-dev is necessary here, though you may want a more detailed plot. There could be some weird ledge of latency that does not show up in the overview. Chasing down the cause could be difficult, but with evidence in hand--N percent of requests now take unexpectedly long--while not much is much better than vague reports of things being slow, especially if there is a clear difference from a previous version of the software.
$ perl -nle 'print $1 if /time=(\S+)/' pl | r-fu summary elements 3600 mean 1.657e+00 range 2.135e+02 min 8.560e-01 max 2.143e+02 sd 4.074e+00 0% 25% 50% 75% 100% 0.8560 1.1050 1.1255 1.1610 214.3070 $ perl -nle 'print $1 if /time=(\S+)/' pl | r-fu histogram - histo.png $ perl -nle 'print $1 if /time=(\S+)/' pl | r-fu boxplot - boxplot.png
Or the behavior could be completely normal--can't know without measuring. This ping test looks pretty normal, though, and I now have something to compare with if something does go bad, or when a new router is purchased, etc.
Formal data collection usually falls under the umbrella of "monitoring" or "metrics" where various systems automatically run ping and various other tools over time, store the results somewhere, and offer the ability to graph the results. The metrics could be of anything--temperature probes, DNS requests, credit card request latency, whatever.
gemini://perso.pw/blog//articles/rrdtool-light-monitoring.gmi
This can get very complicated, though just RRDtool can be a good place to start. Some monitoring tools only check whether something is up (that a web server responds with a particular code or that the page has particular content) so latency checks may need to be added. Also note that latency checks depend on everything that a service uses, so will include the system (are there CPU spikes?), network effects (is the network saturated and thus dropping packets?), DNS (is the usual server down, thus the DNS requests take longer?), as complicated by any caching along the way. Monitoring may amusingly show everything as "okay" because a dashboard has cached a good result but everything is actually on fire, including updates to the dashboard. Maybe the dashboard should include the time of the last update or some other indication of staleness?
Also you probably want the system clock to be as right as possible, as monitoring tends to depend on good timekeeping. It's amazing how often NTP isn't setup right, or they bought some cheap motherboard with a broken clock chip. Years ago there was a credit card latency check that was tied to US/Pacific and therefore twice a year an alarm would fire about credit card latency being bad, because a daylight savings time wobble had just happened, and the latency was suddenly ±3600 seconds out of bounds. Ideally your code will be better written.