Comment by jmking on 28/01/2025 at 07:20 UTC

1 upvotes, 0 direct replies (showing 0)

View submission: Massive Failure on the Product

I know you're asking because you're probably just curious (I was too).

But I'm piggy backing off your comment to say that for this situation, what the issue was is the wrong question. I'd say "How long did it take to discover the issue?" and then "How long from discovering the issue did it take to restore service?".

I personally probably wouldn't have thought to test this situation - especially if I didn't know about these past guest users that aren't users thing. The point is shit happens and you should expect it will happen. Ideally you do what you can to catch as many issues as possible before it hits prod, but even 100% test coverage wouldn't have caught this. The system was technically working correctly. It was preventing duplicate signups.

What's often overlooked, however, is monitoring, telemetry, observability, alarms, etc to proactively detect problems before you hear about it from your users. Then the time to remediation is the next most important thing. How fast is the rollback process (do you even have the ability to roll back a bad deploy?).

The last place you want to find yourself in is having to rush a fix because all you can really do is roll forward.

Replies

There's nothing here!