58 upvotes, 7 direct replies (showing 7)
View submission: Massive Failure on the Product
What was the issue?
Comment by According-Ad1997 at 27/01/2025 at 01:10 UTC
90 upvotes, 3 direct replies
It seems they stored guest users and actual permanent users in the same table, and the table had unique constraints on email. When returning guest users tried to sign up for an account, the db probably threw a unique constraint violation error and rejected the sign up since the email was taken.
All in all, this is a bad thing to happen on roll out but not the worst, especially if the product is good. People will come back. It should be easily fixable if you can identify guest users.
Comment by canadian_webdev at 26/01/2025 at 23:33 UTC
62 upvotes, 2 direct replies
The backend was never built.
Comment by TLJGame at 27/01/2025 at 00:48 UTC
12 upvotes, 1 direct replies
https://www.reddit.com/r/webdev/s/eTQhPTsKrh
Found the issue
Comment by halfxdeveloper at 26/01/2025 at 23:15 UTC
30 upvotes, 0 direct replies
They didn’t load test.
Comment by ihopnavajo at 27/01/2025 at 00:21 UTC
6 upvotes, 0 direct replies
The limit DID exist
Comment by jmking at 28/01/2025 at 07:20 UTC
1 upvotes, 0 direct replies
I know you're asking because you're probably just curious (I was too).
But I'm piggy backing off your comment to say that for this situation, what the issue was is the wrong question. I'd say "How long did it take to discover the issue?" and then "How long from discovering the issue did it take to restore service?".
I personally probably wouldn't have thought to test this situation - especially if I didn't know about these past guest users that aren't users thing. The point is shit happens and you should expect it will happen. Ideally you do what you can to catch as many issues as possible before it hits prod, but even 100% test coverage wouldn't have caught this. The system was technically working correctly. It was preventing duplicate signups.
What's often overlooked, however, is monitoring, telemetry, observability, alarms, etc to proactively detect problems before you hear about it from your users. Then the time to remediation is the next most important thing. How fast is the rollback process (do you even have the ability to roll back a bad deploy?).
The last place you want to find yourself in is having to rush a fix because all you can really do is roll forward.
Comment by Yan_LB at 27/01/2025 at 00:05 UTC
-4 upvotes, 2 direct replies
Look, my new comment