💾 Archived View for gemini.ctrl-c.club › ~stack › gemlog › 2022-12-20.gcc_horrors.gmi captured on 2023-01-29 at 03:25:37. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

GCC code generation quirks

TLDR: scroll to the end for some spectacular examples...

I've been stuck upside down in a peculiar pit of hell, a strange combination of pure joy and absolute frustration. Some of you know I've succumbed to a monumental bucket list project, trying to harness and integrate a C compiler (gcc for now) into a Forth/Lisp/Smalltalk-like interactive environment, with function-level compilation and immediate linking, image, sources integration, and all the good stuff.

Now I am crazy, but not that crazy! I've been doing this kind of stuff my whole life, and am uniquely equipped for it... And I don't want a standards-compilent C environment -- I just want to compile functions and have them be instantly available in the image... And making changes one function at a time.

I am not really a C fan, but it's uniquely right for this. It's low level, linkage and abis are well established, etc. And equally absurdly awful for this - syntactically it's a nightmare. I kind of wish there was a low-level Lisp or something that did not require a hundred megabytes of mystery code.

I've actually been writing some assembly, and really, it is a polyglot environment -- as long as you comply with the ABI, which is not as bad as the Forth devil on my shoulder thinks.

About code generation

So with my head deeply inside the compiler's butt, I've been learning interesting things. Again, amazing and horrible things. I was truly horrified at the quality of code generated -- it seems like they want to make 100-megabyte applications normal. I object. Memory may be cheap, but caches are limited, and really why am I even arguing this.

So yes, after days of figuring out hundreds of pragmas and command-line switches (and just scratching the surface), I turned off much fluff around the code (I am not concerned about security right now). I just want tight code that can be ripped out and stuck inside my system.

First, the calling convention. Jesus. Stack parameter passing seems like a distant dream. We have many registers, and need complicated register allocators. GCC does an OK job, although there are a lot of WTF moments, especially when something compiles differently every time. Thanks, guys.

Register noise

The code generally looks like occasional function calls with an amazing amount of register shuffling, and eventually sticking registers on the stack anyway. I kind of suspect that a datastack, as much as it's maligned, is not so bad - it's pretty localized, and compared to the junk I've been looking at, is much more compact, even on x86-64.

If they spent as much effort making a stack machine with a really fast cache, I suspect it would beat the crap out of Intel and Arm...

Parameter Sequencing

I started out with absurdly large code until I realized that a lot of what I do is write layers of code that calls other code. And it turns out that if you spend a bit of time organizing parameters in a standard way, you can shrink your code by an order of magnitude. Consider a foo(a,b,c,d,e,f) calling bar(e,c,d,a,b). The compiler needs to move every register. But order the parameters in the same sequence, and two hundred bytes of code turns into a single 5-byte call! Well, the compiler will probably save f somewhere and restore it after, but it is an amazing how much smaller my code got.

High-level Coding.

A lot of what I do is basically iterating on the same data - symbols in packages, or relocatable references, and doing all kinds of different things. So there are a lot of stupid loops that look the same to get at every little piece of data.

I've been mostly a lisper for the last decade, so naturally, I want to map, make iterators, and pass lambdas around to functions that access data. Well, there are no lambdas in C. And it's even worse - every attempt to make an iterator that can take a function that operates on every piece of data resulted in really awful, slow, bizzare and enormous code. A simple tight loop searching symbol tables, for instance, can be like 20 bytes...

I've gotten good at making these, but then I wound up with like 30 of slightly different functions, that do something slightly weird at different depths, so factoring did not work. And due to the above parameter things, sometimes trying to factor a common function results in a dramatic increase in bloat...

Nested Functions in GCC

I was really excited, until again, I looked at the code. So you can create a function inside a function, and it is lexically scoped. Sounds good? Well, here is how GCC implemented it. The inside function is compiled separately, in the beginning of the object file, and the name is mangled (so you can use the same names - the scope is limited to inside the outer function. OK.

If you don't use lexical scope to access data in the outer functions, things are pretty good. There is pretty much no difference between regular functions and these hidden functions, except for the hidden part, which is sometimes nice. But, if you do access outer data, all hell breaks loose, and the outer function becomes stupidly enormous. The inner function somehow uses references via R10 and strange offsets, sometimes hundreds of bytes into the stack.

The outer function creates a page of dense code moving registers anround for no apparent reason and putting shit you never know existed onto the stack. It is absolutely incredible to see the garbage - I think they actually put random sequences and magic numbers dozens and dozens of bytes long to somehow mark up the stack in a way that makes this kind of access across function calls possible.

So I quickly gave up on the idea of 'named lambdas' to do specific tasks in different places. Bah.

Another Attempt

For a while I tried to just code up functions and pass them around. I had some limited success with it. But it is insanely tricky to get it right. I made some generator or iterator-like constructs that can invoke functins to actually operate on items... But in the end it was still unwieldy; and the stupid type system gets in the way. Ugh.

Assembler to the Rescue...

I am extremely good at assembly, and actually enjoy it. After a few days of looking at stupid disassembly I got really sick of C and was ready to do some low-level coding...

So here is an good example. I really need go through the symbol table - not a table, but a segment with lots of variable-size cross-linked data that has a tendency to move around. But basically, I have symbols in a linked list, and these lists (call them packages, why not), are grouped into a higher level list that connects package heads for global searching and such.

I am picking a particularly small example (after much struggling, factoring and decomposition)- just so you can see what I mean without looking at pages of code. And don't judge me - I've been rewriting everything over and over and most of this code is really just to see the lay of the land. The system is changing radically, and am actually amazed that it does not completely break whent I make absurdly enormous changes to dozens of subsystems simultaneously...

EXHIBIT 1

// The goal is to scan packages for a symbol with a matching 32-bit hash,
// but return the previous symbol, one that links to the one that matched,
// so I can unlink the symbol of interest.  Note that this function scans
// the package heads, and dispatches to similar functions that actually scan
// the chain of symbols inside each package...  Simple, really.

sSym* pks_find_prev_hash(sSym*pk, U32 hash){
  sSym* ret=0;
  do {
    if((ret = pk_find_prev_hash(pk,hash))) //dispatch to search inside the package
      break;
  } while ((pk = PTR(sSym*,pk->art)));   // get next package
  return ret;
}
And here is what it compiles to (when I started it was many times larger...)


  dc:	55                   	push   %rbp
  dd:	89 f5                	mov    %esi,%ebp
  df:	53                   	push   %rbx
  e0:	48 89 fb             	mov    %rdi,%rbx
  e3:	51                   	push   %rcx
  e4:	89 ee                	mov    %ebp,%esi
  e6:	48 89 df             	mov    %rbx,%rdi
  e9:	e8 00 00 00 00       	call   ee <pks_find_prev_hash+0x12>
  ee:	48 85 c0             	test   %rax,%rax
  f1:	75 07                	jne    fa <pks_find_prev_hash+0x1e>
  f3:	8b 5b 08             	mov    0x8(%rbx),%ebx
  f6:	85 db                	test   %ebx,%ebx
  f8:	75 ea                	jne    e4 <pks_find_prev_hash+0x8>
  fa:	5a                   	pop    %rdx
  fb:	5b                   	pop    %rbx
  fc:	5d                   	pop    %rbp
  fd:	c3                   	ret    

See what I mean?  And this is really clean, compared to what I had to deal with...

Assembler iterators

So while the code above is not too terrible (although it does very little), the problem is that I have many, many versions of similar or worse code, because decoupling iteration made everyting several times larger and stupider. So I coded up a tiny assembler version of the above, which is GENERIC - it can call different function to do different things

;;; rdi = ptr - a package pointer
;;; rsi = parameter to pass around
;;; rdx = function to executee on every symbol

            <pkgs_walk>:
  31:	97                   	xchg   %eax,%edi
  32:	97       <.loop>:      	xchg   %eax,%edi     ; move package head to first parameter
  33:	56                   	push   %rsi
  34:	52                   	push   %rdx
  35:	57                   	push   %rdi
  36:	e8 da ff ff ff       	call   15 <pkg_walk>  ; invoke a similar iterator to search thisp 
  3b:	5f                   	pop    %rdi           ; package...
  3c:	5a                   	pop    %rdx
  3d:	5e                   	pop    %rsi
  3e:	48 85 c0             	test   %rax,%rax       ; test result
  41:	75 06                	jne    49 <.done>
  43:	67 03 47 08          	add    0x8(%edi),%eax  ; fetch next package, set Z on fail
  47:	75 e9                	jne    32 <.loop>
  49:	c3        <.done>:     	ret    


From C it looks like this:

  sSym* pkgs_walk(sSym* pk, U64 param, funcptr proc);

So I can call it with a function which will execute on every symbol.

Now, that is much cleaner and smaller. But it is also completely generic - I will never write another loop again! I can just call it with a function and an optional parameter...

So now I can replace all the loops with a three-parameter call to this. Watch this:

This does exactly what the original EXHIBIT 1 did!

//------------------------------------------------------------------

sSym* pks_find_prev_hash(sSym* pk, U32 hash){

  return pkgs_walk( pk, hash, &pk_find_prev_hash_proc );
}
//------------------------------------------------------------------

And here is what it compiles to, by same gcc:

 <pks_find_prev_hash>:
 
  fe:	48 8d 15 00 00 00 00 	lea    0x0(%rip),%rdx        # 105 <pks_find_prev_hash+0x7>
 105:	e9 00 00 00 00       	jmp    10a <pkgs_walk>

No shit! That's what I was talking about earlier - modularity and parameter organization makes code disappear. This function is called with the initial package and hash to look for, and all it does is add one more parameter to pass to the generic walker - the function that is called on each symbol...

Oh, it looks like I am cheating by not including that function - but that is because it is generic - it is used in several places. Oh, why not, here it is:

//------------------------------------------------------------------

sSym* pk_find_prev_hash_proc(sSym* s, U32 hash, sSym* prev){

    return (s->hash == hash) ? prev : 0;
}
//------------------------------------------------------------------

So this function is called by the iterator on every symbol. Note that the generic walker does all the heavy lifting (and it is radically reusable) -- it actually provides a previous symbol, so all the function does is compare the hash of the visited symbol with the provided one, and when found, returns the previous symbol - or 0, which continues the iteration.

Want to see the code gcc came up with?

<pk_find_prev_hash_proc>:

   0:	89 d0                	mov    %edx,%eax
   2:	31 d2                	xor    %edx,%edx
   4:	39 37                	cmp    %esi,(%rdi)
   6:	0f 45 c2             	cmovne %edx,%eax
   9:	c3                   	ret

Nicely branch free, too! And generic and reusable.

OK, that is enough for now, don't you think?

I will work on this:

00000000000002b5 <pk_rebind>:
 2b5:	41 56                	push   %r14
 2b7:	41 55                	push   %r13
 2b9:	4c 8d 2d 00 00 00 00 	lea    0x0(%rip),%r13        # 2c0 <pk_rebind+0xb>
 2c0:	41 54                	push   %r12
 2c2:	55                   	push   %rbp
 2c3:	53                   	push   %rbx
 2c4:	48 89 fb             	mov    %rdi,%rbx
 2c7:	e8 00 00 00 00       	call   2cc <pk_rebind+0x17>
 2cc:	be 02 00 00 00       	mov    $0x2,%esi
 2d1:	48 89 c7             	mov    %rax,%rdi
 2d4:	49 89 c4             	mov    %rax,%r12
 2d7:	e8 00 00 00 00       	call   2dc <pk_rebind+0x27>
 2dc:	48 89 c5             	mov    %rax,%rbp
 2df:	48 85 c0             	test   %rax,%rax
 2e2:	75 56                	jne    33a <pk_rebind+0x85>
 2e4:	48 8b 3d 00 00 00 00 	mov    0x0(%rip),%rdi        # 2eb <pk_rebind+0x36>
 2eb:	4c 89 e2             	mov    %r12,%rdx
 2ee:	48 8d 35 00 00 00 00 	lea    0x0(%rip),%rsi        # 2f5 <pk_rebind+0x40>
 2f5:	31 c0                	xor    %eax,%eax
 2f7:	e8 00 00 00 00       	call   2fc <pk_rebind+0x47>
 2fc:	bf 01 00 00 00       	mov    $0x1,%edi
 301:	e8 00 00 00 00       	call   306 <pk_rebind+0x51>
 306:	4c 8d 73 16          	lea    0x16(%rbx),%r14
 30a:	48 89 ef             	mov    %rbp,%rdi
 30d:	4c 89 f6             	mov    %r14,%rsi
 310:	e8 00 00 00 00       	call   315 <pk_rebind+0x60>
 315:	49 89 c4             	mov    %rax,%r12
 318:	48 85 c0             	test   %rax,%rax
 31b:	75 0d                	jne    32a <pk_rebind+0x75>
 31d:	4c 89 f6             	mov    %r14,%rsi
 320:	4c 89 ef             	mov    %r13,%rdi
 323:	31 c0                	xor    %eax,%eax
 325:	e8 00 00 00 00       	call   32a <pk_rebind+0x75>
 32a:	8b 43 08             	mov    0x8(%rbx),%eax
 32d:	48 89 ef             	mov    %rbp,%rdi
 330:	67 4c 89 60 02       	mov    %r12,0x2(%eax)
 335:	e8 00 00 00 00       	call   33a <pk_rebind+0x85>
 33a:	8b 5b 04             	mov    0x4(%rbx),%ebx
 33d:	85 db                	test   %ebx,%ebx
 33f:	75 c5                	jne    306 <pk_rebind+0x51>
 341:	5b                   	pop    %rbx
 342:	5d                   	pop    %rbp
 343:	41 5c                	pop    %r12
 345:	41 5d                	pop    %r13
 347:	41 5e                	pop    %r14
 349:	c3                   	ret

index

home