A few days to change what amounts to about three lines of code
I spent a few days tracking down a tough bug in “Project: Sippy-Cup [1]”. The bug wasn't in the happy path [2] (explained here [3]) but the “unhappy path” that happens when a timeout occurs [4] [memo to self: check the blog before spending a few hours rewriting what you previously wrote].
We were able to reproduce the error rather quickly. What took the time was trying to isolate the root cause, and what I should have done [memo to self: reread my previous memo to self] was enable the statement logging [5] I did a few years ago. Doing so allowed me to find the root rather quickly.
Each request is handled in a coroutine (think “very cheap thread”). The coroutine will acknowledge the incoming SIP (Session Initiation Protocol) message, extract the data we need from said message and send it to the “Business Logic Unit.” But because we have a limited amount of time to do that voodoo that we do (else we exceed our SLA (Service Level Agreement) and have to pay large amounts of money to The Monopolistic Phone Company), we will only wait for a limited period of time for the reply from the “Business Logic Unit.” In either case, we then send out a SIP message and wait for an acknowledgement that it was received.
The happy path—nothing times out and everybody is happy. The unhappy path is when the reply from the “Business Logic Unit” is late, and that late reply was somehow showing up as the “acknowledgement,” which my code didn't handle gracefully at all!
So normally the program does:
- coroutine: send SIP acknowledgement;
- coroutine: send data to “Business Logic Unit”
- coroutine: yield CPU (Central Processing Unit)
- main: receive “Business Logic Unit” reply
- main: parse “Business Logic Unit” reply
- main: schedule coroutine to run with parsed “Business Logic Unit” data
- main: resume coroutine
- coroutine: create new SIP message
- coroutine: send SIP message
- coroutine: yield CPU
- main: receive acknowledgement for SIP message
- main: schedule coroutine to run with parsed acknowledgement
- coroutine: exit satisfied in doing a job well done
And the unhappy path was supposed to go like:
- coroutine: send SIP acknowledgement;
- coroutine: send data to “Business Logic Unit”
- coroutine: yield CPU
- main: determines that coroutine has timed out waiting for “Business Logic Unit” data
- main: schedule coroutine to run with timedout error
- main: receive “Business Logic Unit” reply
- main: parse “Business Logic Unit” reply
- main: **ignore “Business Logic Unit” reply since it's too late**
- main: resume coroutine
- coroutine: create a new SIP message with “Data Unavailable”
- coroutine: send SIP message
- coroutine: yield CPU
- main: receive acknowledgement for SIP message
- main: schedule coroutine to run with parsed acknowledgement
- coroutine: exit satisfied in doing a job well done
So really, not much difference. But in reality things were different:
- coroutine: send SIP acknowledgement;
- coroutine: send data to “Business Logic Unit”
- coroutine: yield CPU
- main: determines that coroutine has timed out waiting for “Business Logic Unit” data
- main: schedule coroutine to run with timedout error
- main: receive “Business Logic Unit” reply
- main: parse “Business Logic Unit” reply
- main: **schedule coroutine to run with parsed “Business Logic Unit” data**
- main: resume coroutine
- coroutine: create a new SIP message with “Data Unavailable”
- coroutine: send SIP message
- coroutine: yield CPU
- main: resume coroutine (because of the second scheduling call)
- coroutine: expect acknowledgement, get something unexpected instead
- coroutine: CRASH!
The fix, as always, was easy—if a coroutine is scheduled to run, don't reschedule it. What's not so easy is finding the root cause.
[1] /boston/2014/03/05.1
[2] https://en.wikipedia.org/wiki/Happy_path
[3] /boston/2015/02/05.1
[4] /boston/2015/02/16.1
[5] /boston/2015/02/12.1
Gemini Mention this post
Contact the author