I'm still obsessing over stupid benchmarks. This time, I was curious about the overhead of calling functions and system calls, to simulate a typical locking scenario, like:
>
```
int gv;
void t(void)
{
while(gv < 1000000000)
{
lock();
gv++;
unlock();
}
}
```
So, I wrote some test code:
>
```
int gv
void null(void)
{
}
void t1(void)
{
while(gc < 1000000000)
{
null();
gv++;
null();
}
}
void t2(void)
{
pid_t me;
me = getpid();
while(gc < 1000000000)
{
kill(me,0);
gv++;
kill(me,0);
}
}
```
The first test is pretty simple—it just tests the function call overhead, and that particular piece of code takes 9 seconds (give or take a few tenths of a second) to run. But let me explain about the second test, and therein lies the interesting bit of this tale.
I wanted to test the system call overhead, and thus, I needed a system call that did very little. I found this bit in the kill() manpage:
If sig is 0, then no signal is sent, but error checking is still performed.
That's about as close to a null system call as I could find. No real bad side effects, and I don't particularly care about the return code (since this is just measuring overhead). So I ran that bit of code. Fifteen minutes later it finished.
Fifteen minutes.
Either system calls are slow, or that particular system call does a vast amount of nothing. Curious, I delved into that part of the kernel source (I'm running all this under Linux, by the way) when I came across this bit of kernel code:
>
```
asmlinkage long sys_getpid(void)
{
return current->tgid;
}
```
Now, I knew that getpid() was a system call, but I didn't realize it was a system call. What I mean by that is, even though it's documented in section 2 of the manual (usually reserved for system calls), it may not actually call into the kernel. For instance, alarm() is documented in section 2, but it's a library call that eventually calls other system calls. And I thought that getpid() was a library call that just returned a value (the current process ID) stored in userspace (which makes sense to me—it's not like it changes or anything as the process runs). I didn't expect it to be an actual system call.
Well then.
And it's about as low overhead of a system call as you can get.
So I rewrote t2():
>
```
void t2(void)
{
while(gc < 1000000000)
{
getpid();
gv++;
getpid();
}
}
```
And the runtime dropped to 17 seconds.
Okay, much better. Not quite twice the function call overhead. But what the heck was kill() doing that warranted fifteen minutes? I checked the Linux source code and it didn't appear to be that much. What the heck was going on?
So what's the actual code to getpid()? Was it perhaps making the system call once and storing the result? Time to check the code to getpid(). Easiest way to do that was to statically compile the test program, and disassemble it (using objdump):
>
```
0804e380 <__getpid>:
804e380: b8 14 00 00 00 mov $0x14,%eax
804e385: cd 80 int $0x80
804e387: c3 ret
```
Okay. Not much to it. What if I make the system call directly then? And here, I resorted to assembly (since the inline assembly directives for gcc make no sense to me whatsoever):
>
```
t41: mov eax,20
int $80
mov eax,[gv]
inc eax
mov [gv],eax
mov eax,20
int $80
mov eax,[gv]
cmp eax,1000000000
jl t41
ret
```
(oh, the reason the code looks a bit different? objdump uses AT&T (American Telephone and Telegraph) syntax, while nasm uses Intel syntax)
It was a bit surprising to find myself waiting for 11 minutes for this bit of code to run.
Okay, that's more on par with using the system call kill(), but that doesn't explain why my version took 66 times longer. Perhaps it's a CPU (Central Processing Unit) pipeline issue? Perhaps it is faster to call a function that invokes a system call than it is to inline the actual system call. So I'll called the function getpid():
>
```
extern __getpid
t42: call __getpid
inc dword [gv]
call __getpid
cmp dword [gv],1000000000
jl t42
ret
```
Back to 17 seconds. Okay, what if I supply my own wrapper function around the getpid() system call?
>
```
t43: call getpid
inc dword [gv]
call getpid
cmp dword [gv],1000000000
jl t43
ret
getpid: mov eax,20
int $80
ret
```
And just to make sure, check with objdump:
>
```
080484b0 <getpid>:
80484b0: b8 14 00 00 00 mov $0x14,%eax
80484b5: cd 80 int $0x80
80484b7: c3 ret
...
0804e420 <__getpid>:
804e420: b8 14 00 00 00 mov $0x14,%eax
804e425: cd 80 int $0x80
804e427: c3 ret
```
So, other than the memory location of the routine (and the name, but the name isn't part of the executable) they're the same. So why does my version of the getpid() function take 11 minutes to run? Something odd is going on somewhere.
I then wrote a version of getpid() that caches the return value from the first call (and this time, I made it callable by C functions):
>
```
global my_getpid
my_getpid: mov eax,[pid]
or eax,eax
jz .10
ret
.10 mov eax,20
int $80
mov [pid],eax
ret
```
Two tests, one calling this routine from C, another test calling this routine from another assembly routine. And both take around 10 seconds to run, which is just a shade slower than calling a null routine.
But calling __getpid() (the system supplied C library routine that calls the system call) is faster than calling getpid() (the function I wrote to call the system call), even though they're identical functions.
Is there something else going on that I'm not aware of?
Weird …