Comment by TScottFitzgerald on 26/01/2025 at 22:38 UTC

58 upvotes, 7 direct replies (showing 7)

View submission: Massive Failure on the Product

What was the issue?

Replies

Comment by According-Ad1997 at 27/01/2025 at 01:10 UTC

90 upvotes, 3 direct replies

It seems they stored guest users and actual permanent users in the same table, and the table had unique constraints on email. When returning guest users tried to sign up for an account, the db probably threw a unique constraint violation error and rejected the sign up since the email was taken.

All in all, this is a bad thing to happen on roll out but not the worst, especially if the product is good. People will come back. It should be easily fixable if you can identify guest users.

Comment by canadian_webdev at 26/01/2025 at 23:33 UTC

62 upvotes, 2 direct replies

The backend was never built.

Comment by TLJGame at 27/01/2025 at 00:48 UTC

12 upvotes, 1 direct replies

https://www.reddit.com/r/webdev/s/eTQhPTsKrh

Found the issue

Comment by halfxdeveloper at 26/01/2025 at 23:15 UTC

30 upvotes, 0 direct replies

They didn’t load test.

Comment by ihopnavajo at 27/01/2025 at 00:21 UTC

6 upvotes, 0 direct replies

The limit DID exist

Comment by jmking at 28/01/2025 at 07:20 UTC

1 upvotes, 0 direct replies

I know you're asking because you're probably just curious (I was too).

But I'm piggy backing off your comment to say that for this situation, what the issue was is the wrong question. I'd say "How long did it take to discover the issue?" and then "How long from discovering the issue did it take to restore service?".

I personally probably wouldn't have thought to test this situation - especially if I didn't know about these past guest users that aren't users thing. The point is shit happens and you should expect it will happen. Ideally you do what you can to catch as many issues as possible before it hits prod, but even 100% test coverage wouldn't have caught this. The system was technically working correctly. It was preventing duplicate signups.

What's often overlooked, however, is monitoring, telemetry, observability, alarms, etc to proactively detect problems before you hear about it from your users. Then the time to remediation is the next most important thing. How fast is the rollback process (do you even have the ability to roll back a bad deploy?).

The last place you want to find yourself in is having to rush a fix because all you can really do is roll forward.

Comment by Yan_LB at 27/01/2025 at 00:05 UTC

-4 upvotes, 2 direct replies

Look, my new comment