An implementation of coroutines for C

Obligatory Sidebar Links

I use coroutines [8] in Lua [9] and I do miss having them in C. Sure, there are a number of coroutine implementations for C, but they all generally fall into two camps:

But really, I just wanted to write my own implementation [10], because.

So the idea is to write some code that works like:

extern int       coroutine_create(coroutine__s **pcp,coroutine_function__f fun);
extern uintptr_t coroutine_yield (coroutine__s *co,uintptr_t value);
extern int       coroutine_free  (coroutine__s *co);

uintptr_t sub_task(coroutine__s *self,uintptr_t value) /* [1] */
{
  value = coroutine_yield(self,value); /* [2] */
	/* [3] */
  value = coroutine_yield(self,value); /* [4] */
	/* [5] */
  value = coroutine_yield(self,value); /* [6] */
	/* [7] */
  return value;
}

void main_task(void)
{
  coroutine__s *co;
  uintptr_t     v;

  coroutine_create(&co,sub_task); /* [8] magic here! */
	/* [ 9] */
  v = coroutine_yield(co,v); /* [10] */
	/* [11] */
  v = coroutine_yield(co,v); /* [12] */
	/* [13] */
  v = coroutine_yield(co,v); /* [14] */
	/* [15] */
  v = coroutine_yield(co,v); /* [16] */
	/* [17] */
  v = coroutine_yield(co,v); /* [18] */
	/* [19] */
  coroutine_free(co);
}

It's a contrived example that one would not use coroutines for, but it does serve to illustrate the issue that popped up while developing the code for this. And I"m going to start coroutine_yield(), as that does the actual switching of the stack to another “unit of execution” (note: this code is for the Intel 32-bit x86 architecture):

%assign	P_param		8 + 16
%assign	P_co		4 + 16

coroutine_yield32:
		push	ebp			; save callee saved registers
		push	ebx
		push	esi
		push	edi

		mov	eax,[esp + P_param]	; return parameter
		mov	edx,[esp + P_co]	; get stack to yield to
		xchg	esp,[edx]		; YIELD!

		pop	edi			; retore registers
		pop	esi
		pop	ebx
		pop	ebp
		ret

Since this is interfacing with C, I have to use the x86 32-bit calling convention [11] (and for the record, I'm using the Intel syntax, not the AT&T syntax). Parameters are passed on the stack, and the callee (in this case, coroutine_yield32()) needs to save certain registers.

Normally, when switching a “unit of execution” such as a thread or process, one needs to save the entire CPU state. But I can cheat here—I'm calling a function, so I can skip saving registers the callee can use (read: trash), which saves a bit of time in the switching. So that's what's going on here. I have the registers that the C calling convention require saving, putting P_param into EAX to return it, get the pointer to the stack we're switching to and at the line that states “YIELD!” we switch the “units of execution.” The final five instructions are running under the coroutine, where we pull the registers saved and return into our now running coroutine.

But now here's the problem—this assumes the stack for the coroutine is properly initialized. Refering back to the C code, line 12 will yield back to line 3 and it works there because everything has been set up. But line 10 is problematic—that's the first switching of execution, and we haven't actually started sub_task(), which is expecting arguments already existing on the stack. Furthermore, for the C calling convention to work, we need to actually call sub_task(). I really don't want to mess up coroutine_yield() with special code to handle that case (that's just … ugly). I want to handle this cleanly.

So the first coroutine_yield() needs to call into (as per our example) sub_task(). The code for that looks like:

		push	eax		; return from coroutine_yield
		push	<coroutine self parameter>
		call	<our function>

Setting aside where we'll get the coroutine self paramter and the address for the function, we just need to ensure that our first call to coroutine_yield() resumes to this code fragment. And we can do that in the coroutine_create()—initialize the stack of the coroutine properly such that that happens. So let's name our fragment:

start_it_up:	push	eax		; return from coroutine_yield
		push	<coroutine self parameter>
		call	<our function>

and we can initialize the coroutine stack:

		mov	dword [ecx + 16],start_it_up
		xor	eax,eax
		mov	[ecx + 8],eax		; "saved" EBX
		mov	[ecx + 4],eax		; "saved" ESI
		mov	[ecx + 0],eax		; "saved" EDI
		mov	[edx],ecx		

For now, just accept that we have the new coroutine stack pointer in ECX (the final version uses ECX but I don't want to spoil things too much at this point). This populates the stack with the values needed for coroutine_yield() to fall into our code fragment, which is techincally a thunk [12]. Now we turn our attention to saving the data required for our new thunk to call our function.

Now, on the 32-bit x86, a classical stack frame will look something like this:

Table: Typical stack frame
offset from EBP	contents
------------------------------
12	parameter 2
8	parameter 1
4	return address
0	previous stack frame address (previous EBP)
-4	local variable 1
-8	local variable 2

The thunk doesn't need paramaters, nor does it need the return address or even a previous stack frame. We just need some local variables. So set up the stack like:

Table: Coroutine stack contents
EBP of coroutine	
	coroutine pointer
	address of sub_task()
	address of start_it_up
	stack frame for start_it_up
	“saved” EBX
	“saved” ESI
ESP of coroutine	“saved” EDI

We can fix start_it_up:

%assign L_co		-4
%assign L_fun		-8

start_it_up:	push	eax
		push	dword [ebp + L_co]
		call	[ebp + L_fun]

And with our C example, this will get us to through line 15. At line 16 we have an issue, where we resume at line 7 and our coroutine now returns. Well, we did call it, so we get its return value back to our thunk. Well, the easy thing here is to just yield it back. And since we have the stack set for a call, we can save some instructions:

%assign L_co		-4
%assign L_fun		-8
%assign C_param		-12

start_it_up:	push	eax
		push	dword [ebp + L_co]
		call	[ebp + L_fun]

		mov	ebp + C_param],eax
		call	coroutine_yield32

And that will get us to line 18. But now we no longer have a running coroutine and we've run off the bottom of our thunk. There are two options here:

Both are valid responses, but I like the second one better as you might not know if a coroutine has finished or not. And that just requires one more instruction to start_it_up:

%assign L_co		-4
%assign L_fun		-8
%assign C_param		-12

start_it_up:	push	eax
		push	dword [ebp + L_co]
		call	[ebp + L_fun]

do_it_again:	mov	ebp + C_param],eax
		call	coroutine_yield32
		jmp	do_it_again

And there you go—coroutines for C [13].

The 64-bit version [14] is pretty much the same—just that the registers needed to be saves are different, and the parameters are passed in registers instead of the stack, but overall, it's the same approach.

Should this code be used in production? I don't know. It works for Linux (both 32 and 64 bit versions) and for Mac OS-X (64 bit version). And while you can use setjmp()/longjmp(), you **CANNOT** do so across coroutine stacks (within the same coroutine—fine). And this has only been tested for C, **NOT** for C++. I don't know enough about C++ (or its calling conventions or exception handling) to recommend this for that.

But really that's all there is to it for coroutines in C.

And the final question—what are coroutines good for? That's for another post.

[1] http://www.1024cores.net/home/lock-free-algorithms/tricks/fibers

[2] http://fanf.livejournal.com/105413.html

[3] https://github.com/stevedekorte/coroutine

[4] https://swtch.com/libtask/

[5] http://yosefk.com/blog/coroutines-in-one-page-of-c.html

[6] http://www.chiark.greenend.org.uk/~sgtatham/coroutines.html

[7] https://github.com/mpu/gthreads/tree/code0

[8] /boston/2007/01/14.1

[9] http://www.lua.org/

[10] https://github.com/spc476/C-Coroutines

[11] https://idea.popcount.org/2013-07-16-baby-steps-in-x86-assembly/

[12] https://en.wikipedia.org/wiki/Thunk

[13] https://github.com/spc476/C-Coroutines/blob/master/coroutine_yield-x86-32.asm

[14] https://github.com/spc476/C-Coroutines/blob/master/coroutine_yield-x86-64.asm

Gemini Mention this post

Contact the author