💾 Archived View for stack.tilde.cafe › gemlog › 2022-12-20.gcc_horrors.gmi captured on 2024-07-08 at 23:39:43. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2023-09-08)
-=-=-=-=-=-=-
I've been stuck upside down in a peculiar pit of hell, a strange combination of pure joy and absolute frustration. Some of you know I've succumbed to a monumental bucket list project, trying to harness and integrate a C compiler (gcc for now) into a Forth/Lisp/Smalltalk-like interactive environment, with function-level compilation and immediate linking, image, sources integration, and all the good stuff.
Now I am crazy, but not that crazy! I've been doing this kind of stuff my whole life, and am uniquely equipped for it... And I don't want a standards-compilent C environment -- I just want to compile functions and have them be instantly available in the image... And making changes one function at a time.
I am not really a C fan, but it's uniquely right for this. It's low level, linkage and abis are well established, etc. And equally absurdly awful for this - syntactically it's a nightmare. I kind of wish there was a low-level Lisp or something that did not require a hundred megabytes of mystery code.
I've actually been writing some assembly, and really, it is a polyglot environment -- as long as you comply with the ABI, which is not as bad as the Forth devil on my shoulder thinks.
So with my head deeply inside the compiler's butt, I've been learning interesting things. Again, amazing and horrible things. I was truly horrified at the quality of code generated -- it seems like they want to make 100-megabyte applications normal. I object. Memory may be cheap, but caches are limited, and really why am I even arguing this.
So yes, after days of figuring out hundreds of pragmas and command-line switches (and just scratching the surface), I turned off much fluff around the code (I am not concerned about security right now). I just want tight code that can be ripped out and stuck inside my system.
First, the calling convention. Jesus. Stack parameter passing seems like a distant dream. We have many registers, and need complicated register allocators. GCC does an OK job, although there are a lot of WTF moments, especially when something compiles differently every time. Thanks, guys.
The code generally looks like occasional function calls with an amazing amount of register shuffling, and eventually sticking registers on the stack anyway. I kind of suspect that a datastack, as much as it's maligned, is not so bad - it's pretty localized, and compared to the junk I've been looking at, is much more compact, even on x86-64.
If they spent as much effort making a stack machine with a really fast cache, I suspect it would beat the crap out of Intel and Arm...
I started out with absurdly large code until I realized that a lot of what I do is write layers of code that calls other code. And it turns out that if you spend a bit of time organizing parameters in a standard way, you can shrink your code by an order of magnitude. Consider a foo(a,b,c,d,e,f) calling bar(e,c,d,a,b). The compiler needs to move every register. But order the parameters in the same sequence, and two hundred bytes of code turns into a single 5-byte call! Well, the compiler will probably save f somewhere and restore it after, but it is an amazing how much smaller my code got.
A lot of what I do is basically iterating on the same data - symbols in packages, or relocatable references, and doing all kinds of different things. So there are a lot of stupid loops that look the same to get at every little piece of data.
I've been mostly a lisper for the last decade, so naturally, I want to map, make iterators, and pass lambdas around to functions that access data. Well, there are no lambdas in C. And it's even worse - every attempt to make an iterator that can take a function that operates on every piece of data resulted in really awful, slow, bizzare and enormous code. A simple tight loop searching symbol tables, for instance, can be like 20 bytes...
I've gotten good at making these, but then I wound up with like 30 of slightly different functions, that do something slightly weird at different depths, so factoring did not work. And due to the above parameter things, sometimes trying to factor a common function results in a dramatic increase in bloat...
I was really excited, until again, I looked at the code. So you can create a function inside a function, and it is lexically scoped. Sounds good? Well, here is how GCC implemented it. The inside function is compiled separately, in the beginning of the object file, and the name is mangled (so you can use the same names - the scope is limited to inside the outer function. OK.
If you don't use lexical scope to access data in the outer functions, things are pretty good. There is pretty much no difference between regular functions and these hidden functions, except for the hidden part, which is sometimes nice. But, if you do access outer data, all hell breaks loose, and the outer function becomes stupidly enormous. The inner function somehow uses references via R10 and strange offsets, sometimes hundreds of bytes into the stack.
The outer function creates a page of dense code moving registers anround for no apparent reason and putting shit you never know existed onto the stack. It is absolutely incredible to see the garbage - I think they actually put random sequences and magic numbers dozens and dozens of bytes long to somehow mark up the stack in a way that makes this kind of access across function calls possible.
So I quickly gave up on the idea of 'named lambdas' to do specific tasks in different places. Bah.
For a while I tried to just code up functions and pass them around. I had some limited success with it. But it is insanely tricky to get it right. I made some generator or iterator-like constructs that can invoke functins to actually operate on items... But in the end it was still unwieldy; and the stupid type system gets in the way. Ugh.
I am extremely good at assembly, and actually enjoy it. After a few days of looking at stupid disassembly I got really sick of C and was ready to do some low-level coding...
So here is an good example. I really need go through the symbol table - not a table, but a segment with lots of variable-size cross-linked data that has a tendency to move around. But basically, I have symbols in a linked list, and these lists (call them packages, why not), are grouped into a higher level list that connects package heads for global searching and such.
I am picking a particularly small example (after much struggling, factoring and decomposition)- just so you can see what I mean without looking at pages of code. And don't judge me - I've been rewriting everything over and over and most of this code is really just to see the lay of the land. The system is changing radically, and am actually amazed that it does not completely break whent I make absurdly enormous changes to dozens of subsystems simultaneously...
EXHIBIT 1 // The goal is to scan packages for a symbol with a matching 32-bit hash, // but return the previous symbol, one that links to the one that matched, // so I can unlink the symbol of interest. Note that this function scans // the package heads, and dispatches to similar functions that actually scan // the chain of symbols inside each package... Simple, really. sSym* pks_find_prev_hash(sSym*pk, U32 hash){ sSym* ret=0; do { if((ret = pk_find_prev_hash(pk,hash))) //dispatch to search inside the package break; } while ((pk = PTR(sSym*,pk->art))); // get next package return ret; } And here is what it compiles to (when I started it was many times larger...) dc: 55 push %rbp dd: 89 f5 mov %esi,%ebp df: 53 push %rbx e0: 48 89 fb mov %rdi,%rbx e3: 51 push %rcx e4: 89 ee mov %ebp,%esi e6: 48 89 df mov %rbx,%rdi e9: e8 00 00 00 00 call ee <pks_find_prev_hash+0x12> ee: 48 85 c0 test %rax,%rax f1: 75 07 jne fa <pks_find_prev_hash+0x1e> f3: 8b 5b 08 mov 0x8(%rbx),%ebx f6: 85 db test %ebx,%ebx f8: 75 ea jne e4 <pks_find_prev_hash+0x8> fa: 5a pop %rdx fb: 5b pop %rbx fc: 5d pop %rbp fd: c3 ret See what I mean? And this is really clean, compared to what I had to deal with...
So while the code above is not too terrible (although it does very little), the problem is that I have many, many versions of similar or worse code, because decoupling iteration made everyting several times larger and stupider. So I coded up a tiny assembler version of the above, which is GENERIC - it can call different function to do different things
;;; rdi = ptr - a package pointer ;;; rsi = parameter to pass around ;;; rdx = function to executee on every symbol <pkgs_walk>: 31: 97 xchg %eax,%edi 32: 97 <.loop>: xchg %eax,%edi ; move package head to first parameter 33: 56 push %rsi 34: 52 push %rdx 35: 57 push %rdi 36: e8 da ff ff ff call 15 <pkg_walk> ; invoke a similar iterator to search thisp 3b: 5f pop %rdi ; package... 3c: 5a pop %rdx 3d: 5e pop %rsi 3e: 48 85 c0 test %rax,%rax ; test result 41: 75 06 jne 49 <.done> 43: 67 03 47 08 add 0x8(%edi),%eax ; fetch next package, set Z on fail 47: 75 e9 jne 32 <.loop> 49: c3 <.done>: ret From C it looks like this: sSym* pkgs_walk(sSym* pk, U64 param, funcptr proc); So I can call it with a function which will execute on every symbol.
Now, that is much cleaner and smaller. But it is also completely generic - I will never write another loop again! I can just call it with a function and an optional parameter...
So now I can replace all the loops with a three-parameter call to this. Watch this:
This does exactly what the original EXHIBIT 1 did! //------------------------------------------------------------------ sSym* pks_find_prev_hash(sSym* pk, U32 hash){ return pkgs_walk( pk, hash, &pk_find_prev_hash_proc ); } //------------------------------------------------------------------ And here is what it compiles to, by same gcc: <pks_find_prev_hash>: fe: 48 8d 15 00 00 00 00 lea 0x0(%rip),%rdx # 105 <pks_find_prev_hash+0x7> 105: e9 00 00 00 00 jmp 10a <pkgs_walk>
No shit! That's what I was talking about earlier - modularity and parameter organization makes code disappear. This function is called with the initial package and hash to look for, and all it does is add one more parameter to pass to the generic walker - the function that is called on each symbol...
Oh, it looks like I am cheating by not including that function - but that is because it is generic - it is used in several places. Oh, why not, here it is:
//------------------------------------------------------------------ sSym* pk_find_prev_hash_proc(sSym* s, U32 hash, sSym* prev){ return (s->hash == hash) ? prev : 0; } //------------------------------------------------------------------
So this function is called by the iterator on every symbol. Note that the generic walker does all the heavy lifting (and it is radically reusable) -- it actually provides a previous symbol, so all the function does is compare the hash of the visited symbol with the provided one, and when found, returns the previous symbol - or 0, which continues the iteration.
Want to see the code gcc came up with?
<pk_find_prev_hash_proc>: 0: 89 d0 mov %edx,%eax 2: 31 d2 xor %edx,%edx 4: 39 37 cmp %esi,(%rdi) 6: 0f 45 c2 cmovne %edx,%eax 9: c3 ret
Nicely branch free, too! And generic and reusable.
OK, that is enough for now, don't you think?
I will work on this:
00000000000002b5 <pk_rebind>: 2b5: 41 56 push %r14 2b7: 41 55 push %r13 2b9: 4c 8d 2d 00 00 00 00 lea 0x0(%rip),%r13 # 2c0 <pk_rebind+0xb> 2c0: 41 54 push %r12 2c2: 55 push %rbp 2c3: 53 push %rbx 2c4: 48 89 fb mov %rdi,%rbx 2c7: e8 00 00 00 00 call 2cc <pk_rebind+0x17> 2cc: be 02 00 00 00 mov $0x2,%esi 2d1: 48 89 c7 mov %rax,%rdi 2d4: 49 89 c4 mov %rax,%r12 2d7: e8 00 00 00 00 call 2dc <pk_rebind+0x27> 2dc: 48 89 c5 mov %rax,%rbp 2df: 48 85 c0 test %rax,%rax 2e2: 75 56 jne 33a <pk_rebind+0x85> 2e4: 48 8b 3d 00 00 00 00 mov 0x0(%rip),%rdi # 2eb <pk_rebind+0x36> 2eb: 4c 89 e2 mov %r12,%rdx 2ee: 48 8d 35 00 00 00 00 lea 0x0(%rip),%rsi # 2f5 <pk_rebind+0x40> 2f5: 31 c0 xor %eax,%eax 2f7: e8 00 00 00 00 call 2fc <pk_rebind+0x47> 2fc: bf 01 00 00 00 mov $0x1,%edi 301: e8 00 00 00 00 call 306 <pk_rebind+0x51> 306: 4c 8d 73 16 lea 0x16(%rbx),%r14 30a: 48 89 ef mov %rbp,%rdi 30d: 4c 89 f6 mov %r14,%rsi 310: e8 00 00 00 00 call 315 <pk_rebind+0x60> 315: 49 89 c4 mov %rax,%r12 318: 48 85 c0 test %rax,%rax 31b: 75 0d jne 32a <pk_rebind+0x75> 31d: 4c 89 f6 mov %r14,%rsi 320: 4c 89 ef mov %r13,%rdi 323: 31 c0 xor %eax,%eax 325: e8 00 00 00 00 call 32a <pk_rebind+0x75> 32a: 8b 43 08 mov 0x8(%rbx),%eax 32d: 48 89 ef mov %rbp,%rdi 330: 67 4c 89 60 02 mov %r12,0x2(%eax) 335: e8 00 00 00 00 call 33a <pk_rebind+0x85> 33a: 8b 5b 04 mov 0x4(%rbx),%ebx 33d: 85 db test %ebx,%ebx 33f: 75 c5 jne 306 <pk_rebind+0x51> 341: 5b pop %rbx 342: 5d pop %rbp 343: 41 5c pop %r12 345: 41 5d pop %r13 347: 41 5e pop %r14 349: c3 ret