💾 Archived View for dioskouroi.xyz › thread › 24989682 captured on 2020-11-07 at 00:49:25. Gemini links have been rewritten to link to archived content
-=-=-=-=-=-=-
________________________________________________________________________________
My take on it is that I like most of the design choices they've made (Been tinkering on a similar dynamic runtime before).
With a dynamic language like Erlang (as with JS,Lua,etc) the main benefits won't come from better register allocation,DCE,etc that is the primary reason one would pick LLVM but rather just getting rid of the interpreter overhead (and later the oppurtunity to do some higher level type dependent optimizations that will have a higher impact than register allocation and instruction selection).
Type dependant stuff why LLVM is ill suited IMHO and why you only see LLVM being mentioned as the highest optimizataion level by for example JavaScriptCore(Safari) when the runtime hasn't deoptimized code and is pretty certain of the types that will be used.
Also they mentioned the complexity of maintaining interpreter/JITed jumps and I'm not surprised since i remember reading some paper about one of their earlier attempts and they were maintaining dual stacks with quite complex cross-boundary jumps.
Some might mention that V8 moved away from the always-JIT paradigm to save on memory and startup time but since Erlang is mostly used in servers i think they might see this as a good compromise.
> that will have a higher impact than register allocation and instruction selection
The most powerful JIT compiler that I have worked with - Graal - generates code that often on the face of it seems puzzlingly inefficient in terms of registers allocated and instructions selected. Turns out maybe it's another thing that's not as important as all the text books say?
The important bit is removing object allocating, removing boxing, deep inlining, and specialisation... when you've done all that work the exact registers and instructions don't seem to really make a huge difference.
Even with inlining and escape analysis to remove unnecessary boxing, unboxing, and heap allocations, most Java programs still do a lot of pointer chasing. All of that (often poor locality) memory traffic tends to hide the overhead of unnecessary register moves and even some unnecessary register spills.
How fast is code generated by graal?
In languages like Java w/o value types & virtual methods everywhere, interprocedural analysis is very important. Even otherwise I believe interprocedural optimizations are more important than small register fiddling or cute integer arithmetic tricks. But when you reach a dead-end in optimized code, all these things matter.
Sure, it may not matter for Erlang / Elixir on server.
Could that be due to the cleverness of modern heavyweight CPUs, with techniques like register renaming? Would things change if you used less sophisticated processors?
This is the explanation I get when I dig into these things, yes.
For example I'll see what seems to my less-experienced eyes some completely redundant 'mov' instructions shifting things around. I'll ask 'should we fix this' and the response is usually 'doesn't really matter it's basically free'.
Interesting, so (to a rough approximation) the CPU is smart enough to boil off many 'shallow' inefficiencies. Doesn't cache behaviour still matter though? I'd have thought code-compactness would still be worth carefully optimising for.
> Doesn't cache behaviour still matter though? I'd have thought code-compactness would still be worth carefully optimising for.
Well that's partly why I'm still surprised, but I think for a lot of dynamic languages there's such a huge volume of code anyway... that these little 'mov's are pretty insignificant.
Removing them would have a down-side - extra compilation time.
yes and no.
Most instructions have 3-20 cycle latency. A lot of what CPU do on many programs is wait for an instruction to complete so they can use its result to chase some pointer - and in the mean time, find an instruction that doesn't depend on it to execute. So many "simple" moves and register renames are practically free, as they happen in these waiting-for-result slots.
HPC code is most definitely not like that - but it's only found in specialized parts of software (game rendering paths, maths stuff, some NumPy code, etc). But most code spends most of its time with the CPU waiting.
Are there any papers or articles about the mentioned findings?
About what findings, sorry?
Linear Scan is an example I reach for when I talk about what parts of the compiler are really important, if that's what you mean.
http://web.cs.ucla.edu/~palsberg/course/cs132/linearscan.pdf
> The linear scan algorithm is considerably faster than algorithms based on graph coloring, is simple to implement, and results in code that is almost as efficient as that obtained using more complex and time-consuming register allocators based on graph coloring.
Also things like escape analysis and inlining are often called 'the mothers of optimisation' because they fundamentally enable so many other optimisations. Not sure there's really a citation for it but I doubt anyone would dispute.
I wrote about the impact on optimising Ruby.
https://chrisseaton.com/truffleruby/pushing-pixels/
> About what findings, sorry?
These findings: _The important bit is removing object allocating, removing boxing, deep inlining, and specialisation... when you've done all that work the exact registers and instructions don't seem to really make a huge difference._
> Linear Scan is an example I reach for when I talk about what parts of the compiler are really important
Sure, but you just said that register allocation is not a relevant factor compared to other achievements when using Gral VM. Or did I get that wrong?
> These findings:
Right - didn't I cover those? I gave the example that Graal is extremely powerful with a huge amount of money and effort behind it, but doesn't really bother at all with serious register allocation or clever instruction selection as suggested you should in text books. It uses the most basic algorithm and doesn't see any need to do any more, even when they're still keen on tiny improvements to the output. It just doesn't seem to be worth it.
But it does put a lot of effort into object allocation, boxing, inlining, specialisation, etc. So those seem in practice to be worth it.
> But it does put a lot of effort into object allocation, boxing, inlining, specialisation, etc. So those seem in practice to be worth it.
Well, as I understand this is an assumption, not the result of a dedicated study. I made similar observations when using LuaJIT as a SOM backend (see
https://github.com/rochus-keller/Som
), but it's not clear why the Gral based SOM implementations are that much faster.
> Well, as I understand this is an assumption, not the result of a dedicated study.
I'm not sure what you're angling for?
Some kind of falsifiable proof about where it makes sense to put engineering effort and complexity? You're not going to find that sorry nobody's ever been able to seriously quantify those kind of things.
> I made similar observations
Well then why are we arguing about it if it's apparent to both of us?
> I'm not sure what you're angling for?
You assume that _The important bit is removing object allocating, removing boxing, deep inlining, and specialisation... when you've done all that work the exact registers and instructions don't seem to really make a huge difference._
But it would be very interesting to have something like a scientific study about why Gral is indeed faster than other approaches.
> Well then why are we arguing about it if it's apparent to both of us?
Because I would like to understand the _true_ reason to be able to improve my implementation (if feasible).
EDIT: as you claim textbooks about compiler design are wrong or not up to date, so my desire to have someone change that seems understandable, isn't it?
I don't think it's correct to say I'm just assuming.
Linear Scan produces a program with apparently less efficient register allocation. In practice, it does not matter for the wider performance of the code. Is this not evidence to support the assumption that sophisticated register allocation does not matter as much as we thought?
When you enable Graal's more sophisticated escape analysis algorithms you get real-world speed ups in workloads such as Twitter, worth hundreds of thousands of dollars in costs saved. Is this not evidence to support the assumption that sophisticated escape analysis algorithms do matter?
The first is a formal scientific study. The second is not but it's still a very large-scale empirical result measured by an expert. They aren't falsifiable but as I said I don't think that's a realistic expectation, and I think these are enough for it to be more than an assumption.
> Is this not evidence to support the assumption that sophisticated register allocation does not matter as much as we thought?
It's an indication but it doesn't sufficiently support the conclusion. There are so many other things to consider.
> Is this not evidence to support the assumption that sophisticated escape analysis algorithms do matter?
Would you as a scientist seriously accept this as a sufficient evidence for your claims?
But let's leave it at that for the moment. As far as I know there are ongoing research projects which could deliver more insights why specifically a Smalltalk VM runs faster on Gral than anywhere else.
> Would you as a scientist seriously accept this as a sufficient evidence for your claims?
It was a comment on a web page dude... I didn't claim it in a research paper for publication!
If we discourage others from more casually sharing our observations as you're doing we'll miss opportunities to find things to form hypotheses from in the first place! Casual discussion in the community is part of science, something to be encouraged, and you're sadly missing something by dismissing it like this.
Ok, that sounds like a response to my initial question _Are there any papers or articles about the mentioned findings?_
Casually sharing observations is a good thing, and even better when there is some detail information available which makes it possible to understand the propositions sufficiently well and to assess how certain the conclusions are.
> Some might mention that V8 moved away from the always-JIT
They did, but their always-JIT was a bit heavyweight. They generated full method-jit code on first invocation at first. This was a massive improvement over the existing state of the art, but it was also relatively early in the whole development of the "fast JS" ecosystem we live in now.
The _size_ of JS grew, and the size of functions grew, and the complexity of logic grew. The amount of cold and run-once code grew.
That issue was brought into focus by the growth of the webapp space and the size of the payloads.
In this case (erlang folks) - they're going about the whole thing in a very good way. Their "jit-everything" is actually "jit one instruction at a time", which is _amazingly_ perceptive of the challenges people in other JIT teams have faced (e.g. JS JIT teams). What they're doing is tightly scoped, easy to bootstrap and test with fallbacks to VM-calls into the VM for slowpaths or complicated stuff.
That's a solid base you can slowly layer with higher tiers later if it matters. They're keeping their abstraction layers strong by having a well-specified bytecode system, and hopefully they will strive to keep it relatively independent of the runtime, and avoid leakages of runtime semantics into instructions.
I was personally very impressed by their description of their approach and motivation behind each decision.
> Erlang is mostly used in servers i think they might see this as a good compromise.
I'm not sure what the differences will be but start up time is still a very big deal for web servers.
For example I can start up a non-Dockerized Python / gunicorn app with 8,000 lines of code and 50+ package dependencies in about 400ms on a $5 a month DigitalOcean server without doing any type of optimizations.
For someone who just wants to deploy a project to 1 server that means there's only 400ms of down time when I restart my server.
If a runtime takes let's say 10 seconds to start up, that could be enough of a deterrent to not use it if you know you're going down the route of a single server deploy.
> For someone who just wants to deploy a project to 1 server that means there's only 400ms of down time when I restart my server.
Erlang already gives you robust primitives to do things like blue-green deploys, and even graceful transfer of server state even across versioning changes. If downtime between releases is an something you care about, you should use those, and it's likely that your downtime can be in the microseconds range regardless of the vm startup latency.
I don't think hot reloads in Elixir are used that often for web apps due to how much complexity / state changes need to take place. I'm basing that on replies I've seen from a lot of folks in the forums who say it's not really worth the trouble due to how much can go wrong and how involved it would be to do such a thing.
Which blue-green deploy primitives are you referencing btw?
Not referring to hot reloads.
To do a simple blue green deploy you just have to use the builtin node system to make a cluster, and use :global to detect if there are any downversion nodes, if the current node is up version, it slays a downversion node and your external restart mechanism (either erlangs heart, systemd, kubernetes, or Amazon elb, or whatever brings back the node but with the newer code).
Do you happen to have a working example of that posted somewhere? A blog post with code perhaps? I'm unable to find a single example of that because every search around this topic turns up things like hot reloading and a bunch of people saying not to use it.
I don't think we're talking "start time" so much as "warm up time"; if I understand these things correctly it's likely the VM would start almost immediately, but would take a little bit of time for hot code paths to JIT and become highly performant. I don't think that would be much of a concern in your example.
From my conversation with them on Elixir Mix I don't believe warm-up is the issue. In this case the JIT is simpler than that. But the BEAM VM isn't the quickest starter in the world.
Interesting. Here is also a podcast about the topic:
https://thinkingelixir.com/podcast-episodes/017-jit-compiler...
Had a brief look at asmjit; as it seems it only supports x86 and x86_64 and is not really an abstraction (i.e. a platform independent IR). I will try to find out why they didn't use e.g. LLVM or sljit.
EDIT: according to this article (
https://www.erlang-solutions.com/blog/performance-testing-th...
) the speed-up factor caused by the JIT is about 1.3 to 2.3 (as a comparison the speed-up between the PUC Lua 5.1 interpreter and LuaJIT 2.0 is about factor 15 in geometric mean over all benchmarks, see
http://luajit.org/performance_x86.html
).
LLVM is much too heavyweight for a JIT. It's slow at generating code, and makes it difficult to implement common JIT optimizations such as inline caches, which rely on code-patching.
Yes, LLVM is huge, but there are a couple of interesting alternatives such as sljit, see e.g.
https://github.com/wdv4758h/awesome-jit
Even LuaJIT can be used as a backend for other languages than Lua, see e.g.
https://github.com/rochus-keller/oberon/
.
Makes me wonder if Python implemented in lua wouldn't just win over CPython
If using LuaJIT it would be faster than CPython for sure, maybe even as fast as PyPy. Here are some measurement results comparing the C++ SOM interpreter with a LuaJIT bytecode compiler:
https://github.com/rochus-keller/Som#a-som-to-luajit-bytecod...
. The bytecode version of the SOM benchmark running on LuaJIT is about factor 24 faster than the C++ based interpreter. But the Gral implementation of SOM is even faster.
(Which is, of course, ironic given that the entire original goal of LLVM was only to build a JIT; it was too slow, but made an OK enough compiler backend ;P.)
> the entire original goal of LLVM was only to build a JIT
That's not exactly true from what I recall of speaking with Vikram and Chris. The idea was to build a compiler framework that could support profile-guided optimization (and other then-advanced JIT techniques) for idle-time or install-time optimization of C/C++ codes. Chris's original thesis doesn't even mention JIT compilers, except as comparison point for techniques, and instead focuses on things like link-time optimization. (Indeed, Chris offered LLVM to replace gcc's then-broken LTO infrastructure, but this was turned down.)
Clearly, Chris's thought process has to be correct to some real extent and mostly must be deferred to in the end; but, I was a grad student interested in the Low-Level _Virtual Machine_ at the time, and I remember it was absolutely considered by everyone to be a runtime execution system with a JIT ;P and what made it unique vs. "high-level" virtual machines was that it was going to allow unsafe code, trusted by default, but otherwise work like any other virtual machine, allowing for JIT-like benefits (profile guided being one, as you mention)... I am totally willing to believe that everyone misunderstood the project at the time and the at the well over a decade since has corrupted what memory I have of the discussions.
There is an aarch64 branch in AsmJit project that provides an experimental AArch64 backend. It's pretty complete.
Having a platform independent IR in AsmJit is not planned. AsmJit is more about control and ISA completeness. IR could of course be implemented on top of AsmJit, but it seems nobody really needed that so far - users that need complex IR can use LLVM and users that want something really simple can use SLJIT. AsmJit is just in a different category and it has a lot of useful features considering its size.
Also keep in mind that the lua speedup is in interpreted vs compiled, and erlang is already a compiled language that had a fairly performant intermediate bytecode.
But the bytecode is interpreted, isn't it? The PUC Lua VM is a bytecode interpreter as well.
Re: “it seems it only support x86 and x86_64”
Probably because it uses DynASM and I believe only those platforms are supported.
https://luajit.org/dynasm.html
> Probably because it uses DynASM
I had a look at
https://github.com/asmjit/asmjit/tree/master/src/asmjit
and didn't find any indication that DynASM is used. Anyway, DynASM and LuaJIT are available for many different architectures, not only x86 and x86_64.
This link has more detail
https://github.com/erlang/otp/pull/2745#issuecomment-6914821...
“
LLVM is much slower at generating code when compared to asmjit. LLVM can do a lot more, but it's main purpose is not to be a JIT compiler.
With asmjit we get full control over all the register allocation and can do a lot of simplifications when generating code.
On the downside we don't get any of LLVMs built-in optimizations.
We also considered using dynasm, but found the tooling that comes with asmjit to be better.
“
This quote from your reference gives the answer: _We also considered using dynasm, but found the tooling that comes with asmjit to be better._
EDIT: But it still gives no answer why they preferred asmjit in favour of other solutions than LLVM or DynASM (where the latter is just an assembler which helps to integrate with C code, not a platform abstraction like LLVM or others).
With this approach, it shouldn't be too far off to experiment with ahead of time compilation?
I think there has existed Erlang to C compilers in the past and there are new one popping up every now and then. In general though AOT is in many ways considered a dead end for many unless you have a certain application for it (even if that was my uni thesis subject).
You can look back all the way to a 1995 paper (1) (with people that then worked on early Java JIT's) that did a comparison of JIT vs AOT compilation for Self, while good AOT is kind of feasible in many ways there is always situations in languages not initially designed for AOT where a JIT won't be forced to go with conservative estimates that hampers performance.
(1)
https://www.cs.ucsb.edu/research/tech-reports/1995-04
The 1995 paper compares Self 1 and 2 which was a JIT which was based on type inference with Self 3 and 4 which was an adaptive compiler (where Java's Hotspot technology came from) which was able to use type feedback thanks to the second compiler having access to the results of running code from the first compiler.
So AOT compilation was not tested in that paper.
Was a while ago since i read the paper but at the time Agesen's CPA algorithm was among the most powerful for type inference (and while we've had a bunch of incremental improvements in the field there hasn't really been any order of magnitudes better algorithm that can handle edge cases that many dynamic languages produce), so regardless of if the Self runtime was JIT or AOT the algorithm choice showed what was feasible/usable with AOT.
The Lumen project (
https://github.com/lumen/lumen
) is building a BEAM bytecode compatible AOT compiler. Its primary compilation target is WASM/WASI. Still under very heavy development. (and some things waiting on updates to WASM spec to land)
Data may only be kept (passed) in BEAM registers between instructions.
This may seem silly, aren’t machine registers faster?
Yes, but in practice not by much and it would make things more complicated.
My understanding is that the latency of an L1 cache load is 5 cycles on recent Intel processors.
In tight code that is a really substantial overhead. I get that register allocation is complicated, and I totally see the benefit of simplicity, but it's hard not to think that this will significantly limit performance.
It does limit performance, but not in a relevant way for the type of applications Beam is designed for.
Don't lose sight of two things:
Erlang is designed for massive, distributed, highly reliable network systems. Therefore, it is likely that a lot of Erlang code does a lot of waiting around for the network rather than needing every available CPU cycle.
This is the first time they're releasing a JIT. They'll deliver their performance gains, and if they want to and it seems feasible they can come back and do further work in the future.
But you have to put work in where it's valuable, and somehow I doubt RabbitMQ (for example) is going to care about this being missed this time around.
Skylake Client is "4 cycles for fastest load-to-use (simple pointer accesses) 5 cycles for complex addresses"
https://en.wikichip.org/wiki/intel/microarchitectures/skylak...
If the address of the BEAM register file (rbx, apparently) stays constant, and the last write to that register was >5 cycles ago, the load can be speculated 5 cycles in advance; if the last write was <5 cycles ago, my understanding is that that gets handled by store-to-load forwarding with a latency of ~0.
Often, if your function is that small, inlining it is a better idea anyway?
Recommend changing the title to '...Erlang's JIT' or similar (though perhaps the 'erlang.org' domain provides enough context?)
I sort of wonder if this approach to JITing is worth it over just writing a faster interpreter. This is basically like what V8's baseline JIT used to be and they switched to an interpreter without that much of a performance hit (and there's still a lot of potential optimizations for their interpreter). LuaJIT 1's compiler was similar, although somewhat more elaborate, and yet still routinely beaten by LuaJIT 2's interpreter (to be fair LuaJIT 2's interpreter is an insane feat of engineering).
They detail why they chose the current path rather than continuing improvements of the interpreter in the previous post [1]. See the last few paragraphs for a summary.
[1]
http://blog.erlang.org/a-closer-look-at-the-interpreter/
really enjoying this series. excellent job explaining what would be a complicated topic for newcomers.
How is this a JIT? Everything is compiled to machine code. There's no advanced type specialization or lazy deoptimization. It might as well be an in-memory, lazy, AOT compiler.
> It might as well be an in-memory, lazy
Yeah, that is a jit. Perhaps not a very fancy one, the same way ARC is a very unfancy GC, but it is still a jit.
> Everything is compiled to machine code.
Yes... but it does that just-in-time, which is what JIT stands for.
> There's no advanced type specialization or lazy deoptimization.
I don't think those things are required for it to be a JIT are they? I'd say it's a static JIT, rather than a dynamic, profiling, and specialising JIT, but both are JITs.
> It might as well be an in-memory, lazy, AOT compiler.
Well yeah... that describes a JIT.
A simple method jit is also a good jit, even without escape analysis and advanced optimizations. It avoids the op branching and enables code caching.