💾 Archived View for freeshell.de › gemlog › 2024-05-19_The_prod6_problem.gmi captured on 2024-08-18 at 17:22:19. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

The prod6 problem 😬

I used to work on a web site that got quite a bit of traffic¹ so it was load balanced across a number of servers in a few different data centres. One particular box, prod6, would occasionally become completely unresponsive. It stopped serving pages and the load average and CPU usage rocketed. A bunch of monitoring tools would start firing alerts, someone would bounce the service or the whole box, and everything went back to normal.

Several people (including me) looked for the cause, but nothing out of the ordinary seemed to be happening. The weird part was that it only affected one server, and always about the same time in Friday. What was different about that box? And Friday? The answer, of course, was "nothing". If those symptoms had been significant, they would have lead someone to the problem quickly, and it would have been fixed, and I would have forgotten about it by now².

The site had articles with topics. The topics had a hierarchy. The users could use the topic tree to navigate. But the tree data was derived from the article data, and it was possible (and sometimes necessary) to generate the tree afresh. That was a big job, so it was run off line and published to prod. But someone eventually figured out that the tree generation was getting triggered on prod6. There was a publicly available URL that triggered it (oops), and although we didn't have any links to that, there was a link to it on some Swedish web site (WTF?). The Google bot had of course found the link and tried to follow it. It was unsuccessful, so it retried every Friday.

Learning points

a) A GET request should fetch something, not trigger a rebuild of anything, as it did in this case.

b) The tree build was never needed in prod, but the person responsible thought "it might be".

c) Any URL that you don't want people to know about will eventually end up linked from a web site, possibly in Sweden.

We removed the dodgy URL, the Google bot got a 404 and went away, and the problem stopped.

#oops

back to gemlog

1 - I did some stats on the request logs. We had considerably more traffic from the various monitoring systems than we did from actual users. But we were paranoid that users requests might fail, and they were paying for the service, hence all the monitoring.

2 - I keep notes at work so I can say what I've been doing. If I look back more than a couple of days, I remember so little about the stream of stuff to fix that the notes could have been written by someone else. My working life involves a few big memorable things, but metric tonnes of trivial fixes and tweaks that are lost in the mist of time. Sad, really.