I use coroutines [8] in Lua [9] and I do miss having them in C. Sure, there are a number of coroutine implementations for C, but they all generally fall into two camps:
But really, I just wanted to write my own implementation [10], because.
So the idea is to write some code that works like:
extern int coroutine_create(coroutine__s **pcp,coroutine_function__f fun); extern uintptr_t coroutine_yield (coroutine__s *co,uintptr_t value); extern int coroutine_free (coroutine__s *co); uintptr_t sub_task(coroutine__s *self,uintptr_t value) /* [1] */ { value = coroutine_yield(self,value); /* [2] */ /* [3] */ value = coroutine_yield(self,value); /* [4] */ /* [5] */ value = coroutine_yield(self,value); /* [6] */ /* [7] */ return value; } void main_task(void) { coroutine__s *co; uintptr_t v; coroutine_create(&co,sub_task); /* [8] magic here! */ /* [ 9] */ v = coroutine_yield(co,v); /* [10] */ /* [11] */ v = coroutine_yield(co,v); /* [12] */ /* [13] */ v = coroutine_yield(co,v); /* [14] */ /* [15] */ v = coroutine_yield(co,v); /* [16] */ /* [17] */ v = coroutine_yield(co,v); /* [18] */ /* [19] */ coroutine_free(co); }
It's a contrived example that one would not use coroutines for, but it does serve to illustrate the issue that popped up while developing the code for this. And I"m going to start coroutine_yield(), as that does the actual switching of the stack to another “unit of execution” (note: this code is for the Intel 32-bit x86 architecture):
%assign P_param 8 + 16 %assign P_co 4 + 16 coroutine_yield32: push ebp ; save callee saved registers push ebx push esi push edi mov eax,[esp + P_param] ; return parameter mov edx,[esp + P_co] ; get stack to yield to xchg esp,[edx] ; YIELD! pop edi ; retore registers pop esi pop ebx pop ebp ret
Since this is interfacing with C, I have to use the x86 32-bit calling convention [11] (and for the record, I'm using the Intel syntax, not the AT&T syntax). Parameters are passed on the stack, and the callee (in this case, coroutine_yield32()) needs to save certain registers.
Normally, when switching a “unit of execution” such as a thread or process, one needs to save the entire CPU state. But I can cheat here—I'm calling a function, so I can skip saving registers the callee can use (read: trash), which saves a bit of time in the switching. So that's what's going on here. I have the registers that the C calling convention require saving, putting P_param into EAX to return it, get the pointer to the stack we're switching to and at the line that states “YIELD!” we switch the “units of execution.” The final five instructions are running under the coroutine, where we pull the registers saved and return into our now running coroutine.
But now here's the problem—this assumes the stack for the coroutine is properly initialized. Refering back to the C code, line 12 will yield back to line 3 and it works there because everything has been set up. But line 10 is problematic—that's the first switching of execution, and we haven't actually started sub_task(), which is expecting arguments already existing on the stack. Furthermore, for the C calling convention to work, we need to actually call sub_task(). I really don't want to mess up coroutine_yield() with special code to handle that case (that's just … ugly). I want to handle this cleanly.
So the first coroutine_yield() needs to call into (as per our example) sub_task(). The code for that looks like:
push eax ; return from coroutine_yield push <coroutine self parameter> call <our function>
Setting aside where we'll get the coroutine self paramter and the address for the function, we just need to ensure that our first call to coroutine_yield() resumes to this code fragment. And we can do that in the coroutine_create()—initialize the stack of the coroutine properly such that that happens. So let's name our fragment:
start_it_up: push eax ; return from coroutine_yield push <coroutine self parameter> call <our function>
and we can initialize the coroutine stack:
mov dword [ecx + 16],start_it_up xor eax,eax mov [ecx + 8],eax ; "saved" EBX mov [ecx + 4],eax ; "saved" ESI mov [ecx + 0],eax ; "saved" EDI mov [edx],ecx
For now, just accept that we have the new coroutine stack pointer in ECX (the final version uses ECX but I don't want to spoil things too much at this point). This populates the stack with the values needed for coroutine_yield() to fall into our code fragment, which is techincally a thunk [12]. Now we turn our attention to saving the data required for our new thunk to call our function.
Now, on the 32-bit x86, a classical stack frame will look something like this:
Table: Typical stack frame offset from EBP contents ------------------------------ 12 parameter 2 8 parameter 1 4 return address 0 previous stack frame address (previous EBP) -4 local variable 1 -8 local variable 2
The thunk doesn't need paramaters, nor does it need the return address or even a previous stack frame. We just need some local variables. So set up the stack like:
Table: Coroutine stack contents EBP of coroutine coroutine pointer address of sub_task() address of start_it_up stack frame for start_it_up “saved” EBX “saved” ESI ESP of coroutine “saved” EDI
We can fix start_it_up:
%assign L_co -4 %assign L_fun -8 start_it_up: push eax push dword [ebp + L_co] call [ebp + L_fun]
And with our C example, this will get us to through line 15. At line 16 we have an issue, where we resume at line 7 and our coroutine now returns. Well, we did call it, so we get its return value back to our thunk. Well, the easy thing here is to just yield it back. And since we have the stack set for a call, we can save some instructions:
%assign L_co -4 %assign L_fun -8 %assign C_param -12 start_it_up: push eax push dword [ebp + L_co] call [ebp + L_fun] mov ebp + C_param],eax call coroutine_yield32
And that will get us to line 18. But now we no longer have a running coroutine and we've run off the bottom of our thunk. There are two options here:
Both are valid responses, but I like the second one better as you might not know if a coroutine has finished or not. And that just requires one more instruction to start_it_up:
%assign L_co -4 %assign L_fun -8 %assign C_param -12 start_it_up: push eax push dword [ebp + L_co] call [ebp + L_fun] do_it_again: mov ebp + C_param],eax call coroutine_yield32 jmp do_it_again
And there you go—coroutines for C [13].
The 64-bit version [14] is pretty much the same—just that the registers needed to be saves are different, and the parameters are passed in registers instead of the stack, but overall, it's the same approach.
Should this code be used in production? I don't know. It works for Linux (both 32 and 64 bit versions) and for Mac OS-X (64 bit version). And while you can use setjmp()/longjmp(), you **CANNOT** do so across coroutine stacks (within the same coroutine—fine). And this has only been tested for C, **NOT** for C++. I don't know enough about C++ (or its calling conventions or exception handling) to recommend this for that.
But really that's all there is to it for coroutines in C.
And the final question—what are coroutines good for? That's for another post.
[1] http://www.1024cores.net/home/lock-free-algorithms/tricks/fibers
[2] http://fanf.livejournal.com/105413.html
[3] https://github.com/stevedekorte/coroutine
[4] https://swtch.com/libtask/
[5] http://yosefk.com/blog/coroutines-in-one-page-of-c.html
[6] http://www.chiark.greenend.org.uk/~sgtatham/coroutines.html
[7] https://github.com/mpu/gthreads/tree/code0
[10] https://github.com/spc476/C-Coroutines
[11] https://idea.popcount.org/2013-07-16-baby-steps-in-x86-assembly/
[12] https://en.wikipedia.org/wiki/Thunk
[13] https://github.com/spc476/C-Coroutines/blob/master/coroutine_yield-x86-32.asm
[14] https://github.com/spc476/C-Coroutines/blob/master/coroutine_yield-x86-64.asm