💾 Archived View for thrig.me › blog › 2024 › 11 › 17 › uptime.gmi captured on 2024-12-17 at 10:41:12. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

In Favor Of Low Uptime

Low uptime is good. To be more precise, there is a Goldilocks zone. One probably does not want to be rebooting every few seconds, or remember when changing the network on Windows required a mandatory reboot for each distinct change? Cringeware. Too frequent reboots are just as bad as rebooting almost never. Some things you may want to know include:

Are the developers keeping local state somewhere they should not?
Will the system come up correctly after a reboot?
Does rebooting break something else?

The first point may go under "containerization", and frequent reboots will more quickly reveal any local state that (maybe?) should not be there. If the system is left to rot, who knows what will end up where it (maybe?) should not.

The second point usually involves hardware failure (it no longer boots) or a software issue (the software does not come up to a correct state). Either case, perhaps, is better to learn about during a scheduled outage, not at 03:00 in the morning when the on-call has already had too little sleep and too much work. And, yeah, the plutocrats somehow managed to get you classified as overtime exempt. Have fun, and enjoy the wage theft!

The final point delves into "chaos monkey" type testing where random systems (or entire datacenters, if you're at that scale) are broken in random ways. This can be bad for the customer, as there might be lost orders or whatnot, but how else would you learn that the redundant routes are not actually redundant? Postmortem action items included: fixing the routes that were not actually redundant, adding monitoring to confirm that the redundant routes are actually redundant, better documentation, etc. One can theoretically detect that there is a problem, but it's often quicker to see if the ship actually floats, especially if the problem is one that nobody has thought to theorize about—this ship is unsinkable, but, whoops, icebergs!—or nobody paid close enough attention to the fact that all the payment processor connections were going to a limited set of remote IP addresses. How often do you pay really close attention over time to netstat on a cluster of systems?

On the "reboot almost never" front a bank went like 10 years between outages. This meant they were running around like headless chickens when the system did fail. Maybe the guy who knew how to "start me up" has retired, and now cannot be found? Or did you forget that you probably need to contact and inform customers? Etc.

On the "reboot too often" front I once watched a Windows system reboot itself over and over during a too many hours layover in Gatwick, or maybe it was Heathrow. Life has been much improved by rejecting air travel (Windows, alas, still sometimes plagues me).