Mark brought up (in email) an interesting optimization technique using GCC 3:
I came across an interesting optimization that is GCC specific but quite clever.
In lots of places in the Linux kernel you will see something like:
> > > >
> ```
> p = get_some_object();
> if (unlikely(p == NULL))
> {
> kill_random_process();
> return (ESOMETHING);
> }
>
> do_stuff(p);
>
> ```
>
The conditional is clearly an error path and as such means it is rarely taken. This is actually a macro defined like this:
> > > >
> ```
> #define unlikely(b) __builtin_expect(b, 0)
>
> ```
>
On newer versions of GCC this tells the compiler to expect the condition not to be taken. You could also tell the compiler that the branch is likely to be taken:
> > > >
> ```
> #define likely(b) __builtin_expect(b, 1)
>
> ```
>
So how does this help GCC anyhow? Well, on some architectures (PowerPC) there is actually a bit in the branch instruction to tell the CPU's speculative execution unit if the branch is likely to be taken. On other architectures it avoids conditional branches to make the “fast path” branch free (with -freorder-blocks).
I was curious to see if this would actually help any, so I found a machine that had GCC 3 installed (swift), compiled a version of mod_blog [1] with profiling information, ran it, found a function that looked good to speed up, added some calls to __builtin_expect(), reran the code and got a rather encouragine interesting result.
I then reran the code, and got a completely different result.
In fact, each time I run the code, the profiling information I get is nearly useless—well, to a degree. For instance one run:
Table: Each sample counts as 0.01 seconds. % time cumulative seconds self seconds calls self ms/call total ms/call name ------------------------------ 100.00 0.01 0.01 119529 0.00 0.00 line_ioreq 0.00 0.01 0.00 141779 0.00 0.00 BufferIOCtl 0.00 0.01 0.00 60991 0.00 0.00 line_readchar 0.00 0.01 0.00 59747 0.00 0.00 ht_readchar
Then another run:
Table: Each sample counts as 0.01 seconds. % time cumulative seconds self seconds calls self ms/call total ms/call name ------------------------------ 33.33 0.01 0.01 119529 0.00 0.00 line_ioreq 33.33 0.02 0.01 60991 0.00 0.00 line_readchar 33.33 0.03 0.01 21200 0.00 0.00 ufh_write 0.00 0.03 0.00 141779 0.00 0.00 BufferIOCtl
Yet another run:
Table: Each sample counts as 0.01 seconds. no time accumulated % time cumulative seconds self seconds calls self ms/call total ms/call name ------------------------------ 0.00 0.00 0.00 141779 0.00 0.00 BufferIOCtl 0.00 0.00 0.00 119529 0.00 0.00 line_ioreq 0.00 0.00 0.00 60991 0.00 0.00 line_readchar 0.00 0.00 0.00 59747 0.00 0.00 ht_readchar
And still another one:
Table: Each sample counts as 0.01 seconds. % time cumulative seconds self seconds calls self ms/call total ms/call name ------------------------------ 50.00 0.01 0.01 60991 0.00 0.00 line_readchar 50.00 0.02 0.01 1990 0.01 0.01 HtmlParseNext 0.00 0.02 0.00 141779 0.00 0.00 BufferIOCtl 0.00 0.02 0.00 119529 0.00 0.00 line_ioreq
Like I said, nearly useless. Sure, there are the usual suspects, like BufferIOCtl() and line_ioreq(), but it's impossible to say what improvements I'm getting by doing this. And by today's standards, swift isn't a fast machine being only (only!) a 1.3GHz (gigaHertz) Pentium III with half a gig of RAM (Random Access Memory). I could only imagine the impossibility of profiling under a faster machine, or even imagining what could be profiled under a faster machine.
I have to wonder what the Linux guys are smoking to even think, in the grand scheme of things, if __builtin_expect() will even improve things all that much.
Unless they have access to better profiling mechanics than I do.
Looks like I might have to find a slower machine to get a better feel for how to improve the speed of the program.