💾 Archived View for thrig.me › blog › 2024 › 11 › 14 › multiple-monitorings.gmi captured on 2024-12-17 at 10:41:17. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

Multiple Monitorings

Someone asked what one remote site should be pinged to verify that your host has internet access. One response to this is, why only one site? And another is, why only ping?

Pinging only one site may yield a false positive if only that remote site is down. Some may argue that 9.9.9.9 or whatever hardly ever go down, except when their firewall hates you, or your ISP's firewall hates you, or your own badhost blacklist firewall thing hates the remote site you were monitoring. Maybe that corporation's IP address got added to a blacklist table, or maybe they put you into their blacklist table, or maybe a third-party security service they lean on now flags your address as something to blacklist. Or that highly reliable remote site could have a major outage, and due to your tight coupling of their availability to the question of whether you have internet access your internet access is also reported as down, much as a sailor is dragged down for the crime of being too close to a sinking ship. Pinging two (or more) remote systems gives more information, and different debugging options: is nothing reported down? Only one of N? All of N? The value for N need not be very large, on the assumption that multiple different redundant systems are unlikely to all fail at the same time, and that there are not many such systems available to ping. With only one remote system down it could be a problem with that site, or the route for that site; with several (or all!) remote systems down it is more likely to be a problem with your ISP, the uplink, your local firewall. This may save time on debugging by better directing the on-call to where the problem actually is, and "everything is down!" is a better story than "uh, I don't know" when someone asks what is wrong.†

Only using ping only tells you whether ICMP echo requests and replies work. Maybe a firewall now hates pings, but everything else is okay? For variety one should also test DNS requests (and their latency) and to also do deeper checks on remote sites: if your users use Github then some checks that the Github front page is up and responding quickly might be good to know, and similar for various other services. This way, you may already know of the problem if your monitoring spots the issue first. Or, if users report a problem you can take a peek at the monitoring to see if it confirms what the users are saying: yep, Github is down (or slow), and if not then it might be a partial outage where the front page is okay, but not some sub page users are having trouble with. If you already know that Github is dead, then you can switch to XKCD 303 mode. If the front page checks look good, then you can go directly to figuring out what issue the users are having.

† Messangers are, however, prone to be thrown into pits or whatever punishment is popular these days, even if they have a good story to tell about the bad news.

To Athens and Sparta Xerxes sent no heralds to demand earth, and this he did for the following reason. When Darius had previously sent men with this same purpose, those who made the request were cast at the one city into the Pit and at the other into a well, and bidden to obtain their earth and water for the king from these locations.

— Hdt. 7.133.1