💾 Archived View for dioskouroi.xyz › thread › 29359906 captured on 2021-11-30 at 20:18:30. Gemini links have been rewritten to link to archived content
-=-=-=-=-=-=-
________________________________________________________________________________
A whole page of benchmark results without even listing which x86 he was testing??? This varies a lot from one generation to the next and between Intel & AMD.
There are a ton of tradeoffs that have to be made in the microarchitecture and in general the focus is real code and normal expected sizes rather than benchmarks. Big page multiples are common and then variable sized calls that can be totally unaligned and random sizes.
A proper study of this would test a pile of different mixes of sizes and alignments and test on a bunch of different processor generations.
And then you get results where a hand coded memcpy loop is faster in benchmarks, but in practice in a big program you might find that 'rep movs' is better because of the smaller code footprint. When you only optimize the few cases that matter.
The general rule is that the compiler and system libraries do a really good job optimizing for processors that shipped 10 years ago. So architects get to make decisions about how to help these operations. You can add a nifty feature that makes it faster (like CLZERO) but while it is good for benchmarks it will takes years after you ship before normal programs benefit. You might optimized 'rep movs' but then you find that since it was slow in the past most real system librarys don't call it.
No. The post provides an artifact so that you can test it yourself on your configuration. Knowing how it performed on this particular stepping of that particular microarchitecture won't help you on yours. Yes, your mileage may vary because your mileage _always_ varies, even when you drive from point A to point B twice.
Rant continued. I would prefer if the performance sections of papers were restricted to single paragraphs and that artifacts were required. When I'm reading papers I'm more interested in their ideas than in the third decimal place of the result on that given day on some machine I don't have. _The Unix Timesharing System_ has no performance results. _The Case For A Reduced Instruction Set Computer_ has no performance results. The notion that not only having results is somehow important but that providing the testing methodology and hardware configuration is also necessary is nonsense. It just lards up papers and a couple of years later the cited configuration is irrelevant.
_The CRAY-1 Computer System_ has no performance results but its section on vector processing still rings true today. It's what the RISC-V vector extension uses.
It seems that if you're going to put in data, it should be done rigorously, otherwise, as you say, you're just, at best, distracting from the main idea.
Half the benchmarks i see posted here are done on AWS instances or the authors laptop with power stepping/cstates, unisolated cores etc etc likely all enabled too, let alone basic rigour around hardware specifics.
We are very much entering the age where mechanical sympathy is waning
CPU makers add so many complicated features to let users optimize their programs that it takes too much knowledge for mere mortals to optimize their programs - That you have to rely on heuristics and measures, as it is usually and wisely advised, is a bit unsatisfying. That's sort of ironic, in a way.
I used to do assembly and count cycles, but now I wouldn't dare; it's hardcore compiler and library makers stuff. It's like "don't do your own crypto (optimization)".
Everyone knows why it is so, though - we cannot solve the problem by throwing more Gigahertz at it.
I have hand-written asm for use in production code, and I would do it again. I also knew exactly what cpu my code would be running on. If I were publishing the code to run on a variety of hardware, I would be very cautious.
Writing your own crypto is very different; the stakes are higher if you get it wrong.
Including the microcode version currently installed into the CPU and gone through the release notes, alongside a profile like V-Tune?
This is definitely not correct. For many SIMDable workloads it's possible to achieve order of magnitude improvements over the compiler.
A good heuristic is simply minimize the number of instructions.
I doubt it'd be an issue for memcpy but minimizing the size of a loop body can lead to very counterintuitive speedups now processors have LSDs (loop stream detectors).
If you make them too small it can lead to counterintuitive slowdowns.
I guess that you are referring to other projects because the benchmarks in this repo use a stable seed, turbo disabled, physical machine, both random and stable sizes, etc.
What’s the story with the spikes for folly and your code in memcpy plots? They get chopped off and it’s not clear what number they hit. (Nice work, btw)
Havn't read the article, but c-states and p-states are different to turbo disabled.
Also the benchmark do not seem to use the same set of sizes for all benchmarks (nor use a fixed random seed) so repeatability and comparability seem questionable.
I guess if you run it often enough it could still give useful numbers, but I understand the author is picking the best run.
Also I don't see to be any attempt at avoiding compiler optimisations.
Benchmarking is hard
Grep the code for rand_reset. The code uses a fixed seed, fixed sizes for all programs, stable nop baseline, etc. also the pointer indirection blocks compiler optimizations.
Indeed. i looked at the benchmark itself, but not the RNG.
IMO there's too many things to get right and no easy standard way of doing it.
Is there at least a checklist? It should ideally be automated. IIRC there's some library, maybe BLAS, that at compile time computes some machine-specific constants and also refuses to do it if a few things are not right.
People will tend to focus on the micro-structure of the assembly programs but the takeaway I get from this and recent related work in LLVM is avoiding the PLT is good for an easy 20% win for small sizes, which tend to be the common case. I like the new LLVM memcpy that is just plain C++, and easy to read and understand.
https://github.com/llvm/llvm-project/blob/main/libc/src/stri...
I was a little confused by the assembler in the fourth tweet. Here is what I think is going on. First the code:
1. vxorps %xmm0, %xmm0, %xmm0 2. vmovups %ymm0, 11(%rdi) 3. vmovups %ymm0, (%rdi) 4. vzeroupper
So line 1 sets xmm0, a 128 bit/16 byte register, to 0 (by xoring it with itself). I suppose the convention is for the caller to save the register if it wants it. The thing that confused me is that the way 256 bit/32 byte registers work is that ymmX represents the 256 bit register of which xmmX is the lower half. Similar to the way %eax is the lower 32 bits of %rax. I suppose the upper half of %ymm0 is 0 by ABI.
So line 2 then writes out this 32-byte register (of zeros) starting at buffer[11] and ending at buffer[42]. The actual instruction name stands for “vector move unaligned packed single precision floats” and I don’t really understand why the precision or float type matters.
Line 3 writes it from buffer[0] to buffer[31]. I guess the overlap isn’t expensive.
Line 4 zeros the upper half of the register which is a no-op in this case but might be useful as it breaks the dependency chain so the cpu can know nothing else will need to use the current upper-half values (but isn’t it useful to break the chain on the lower half too?).
Hi Nadav!
Seems like a lot of the benefit also comes from "Intel processors since < X > no longer care about aligned vs unaligned loads and stores" (so older code that spends effort lining this up doesn't benefit in the small cases up to 512 bytes).
Does that rationale still hold for larger blocks and/or hitting the cache lines awkwardly? Like make something where you'll have plenty of 64-byte copies which are offset enough that they are off alignment and pollute the neighboring cache line (probably most visible for memset).
On most new processors, Intel just suggests using REP STOSB and REP MOVSB for memset and memcpy respectively. That's not in the benchmark as far as I can see.
To define "new" more precisely, there is an "Enhanced REP MOVSB" flag in cpuid that tells you you can use those. It is set from Ivy Bridge and Zen 3 up.
It's still not _always_ faster:
https://stackoverflow.com/questions/43343231/enhanced-rep-mo...
As mentioned in that Stack Overflow post, though, things change again with FSRM (Fast Short Rep Mov).
While there are still startup costs, the overhead of calling a function (especially via a PLT) and incurring instruction cache misses is hard to demonstrate in a microbenchmark, while rep movsb encodes more compactly than many flavors of call. In an actual application though, the "slower" but smaller implementation can often win (
https://research.google/pubs/pub50338.pdf
and
https://research.google/pubs/pub48320.pdf
)
Link to Github for easier reading
https://github.com/nadavrot/memset_benchmark
@dang could we make this the submission url? It seems to be the same info, by the same individual.
Maybe twitter isn't the best place to publish this kind of article
Living higher up in the stack, there’s a bit of this I don’t really understand, but I’d like to. Can someone explain what’s going on here with a bit more information/explanation of these techniques?
All this talk about rep [insert instruction here] makes me wonder what the world would be like if we all had vector machines.