💾 Archived View for dioskouroi.xyz › thread › 29413656 captured on 2021-12-04 at 18:04:22. Gemini links have been rewritten to link to archived content

View Raw

More Information

⬅️ Previous capture (2021-12-03)

🚧 View Differences

-=-=-=-=-=-=-

The Fastest FizzBuzz Implementation

Author: Croftengea

Score: 259

Comments: 157

Date: 2021-12-02 08:04:24

Web Link

________________________________________________________________________________

atorodius wrote at 2021-12-02 08:25:38:

Relevant:

https://news.ycombinator.com/item?id=29031488

dang wrote at 2021-12-02 18:30:19:

Thanks! Macroexpanded:

_55 GiB/s FizzBuzz_ -

https://news.ycombinator.com/item?id=29031488

- Oct 2021 (255 comments)

throw10920 wrote at 2021-12-02 19:04:07:

If I may ask, what's the default keybinding for _dang-macroexpand-1_?

dang wrote at 2021-12-02 20:34:00:

crtl+alt+b

lifthrasiir wrote at 2021-12-02 08:52:00:

The developer behind the Assembler version goes by the handle "ais523".

Alex Smith [1] is also known for winning the Wolfram 2,3 Turing Machine Research Prize [2].

[1]

https://www.cs.bham.ac.uk/~ais523/

[2]

https://writings.stephenwolfram.com/2007/10/the-prize-is-won...

buro9 wrote at 2021-12-02 09:53:17:

The simplest change for performance even in Python, JavaScript, etc is to avoid the modulo function.

All of the top contenders on

https://codegolf.stackexchange.com/questions/215216/high-thr...

do exactly this.

Yes there is a lot of optimization specific to languages, OS, CPU... but the takeaway for me isn't that you have to go that extent to build an order of magnitude improvement, you can still achieve many order of magnitudes of improvement by understanding what is happening when you use something like the modulo function and that if you have requirements where you can avoid using it then that's the win.

For FizzBuzz you are told it "iterates from 1 to <some number>"... and modulo would be perfect for "take a single number and produce Fizz, Buzz or FizzBuzz if it's divisable by 3, 5, or both 3 and 5"... but as we're iterating, counters are better suited to the task.

I love seeing epic code golf (and the solution outlined in the article is it), but the takeaway for the majority of engineers is that significant improvements can be made with a readable and maintainable solution just by understanding the requirements and comp sci fundamentals.

jiggawatts wrote at 2021-12-02 10:39:35:

The _really_ heavyweight use of div/mod is to produce the base-10 printed decimal numbers, not in the main control loop.

The record-beating versions all do something clever to optimise the printing of the numbers between the "Fizz" and "Buzz" strings.

richardwhiuk wrote at 2021-12-02 13:48:36:

Before running the code, make sure your CPU does support AVX2. ... You should see "sse2" at a minimum.

AVX2 is not SSE2 - SSE2 is much, much older.

You want to check for the avx2 flag itself.

parhamn wrote at 2021-12-02 13:03:31:

When I saw this a few weeks ago it set off a lot of thinking about compilers and ML.

Even the unoptimized C version is 10x slower than the one that is painstakingly optimized.

Some of this performance gain will certain be achievable by ML/AI assisted compilers and translators. In that case, even a 2x speedup would be game changing. Is anyone using ML in compilers yet? The part where you verify the code still does what you wanted it to do seems trickiest.

m3at wrote at 2021-12-02 13:35:20:

> Is anyone using ML in compilers yet?

Yes :)

Check out TVM for something mainstream:

http://tvm.apache.org/

And LibNC for something esoteric (but elegant):

https://bellard.org/libnc/

lovasoa wrote at 2021-12-02 23:29:07:

The question was about the use of ML in compilers, not the use of compilers in ML.

It was more about things like

http://compilerai.apps.iitd.ac.in/

m3at wrote at 2021-12-03 00:11:38:

Oh indeed I misread, thanks for the correction. Note though that TVM is _also_ using ML in the compiler:

https://arxiv.org/abs/2006.06762

solmag wrote at 2021-12-02 09:13:30:

I think the original author said that it was harder to make this than his masters thesis.

DeathArrow wrote at 2021-12-02 09:29:32:

It seems there's a huge gap between what machine code compilers can generate and hand written optimized assembler. There's a lot of room left for compilers to be improved.

andrewaylett wrote at 2021-12-02 10:43:20:

Back when I was employed to work on a compiler, our approach to such things was to implement our run-time libraries in C and fix the compiler so it was able to produce the desired output.

In this case, though, the biggest differences aren't coming from C vs assembler but from walking away from platform conventions -- ABI, memory layout, CPU compatibility. It might be easier to write assembly than to teach the compiler how to generate code that does this, but that's largely because you don't need these optimisations in most cases. If the methods used provided a general speedup, it would be worth implementing them in compilers.

btilly wrote at 2021-12-02 12:40:02:

I suspect that a compiler which produced code that will crash when you attempt to write to a file would be declared buggy and unusable.

And yet, that is one of the optimizations that helps get this one the speed crown.

bawolff wrote at 2021-12-02 09:45:58:

Most programs aren't fizzbuzz. This is basically hand optimizing a microbenchmark. While i'm sure compilers can always be better i'm not sure this particular example generalizes.

londons_explore wrote at 2021-12-02 10:15:03:

But hand optimizing a few microbenchmarks like this tells us how far we might be able to improve compilers.

And from the looks of it, at least 100x is possible. It's a shame therefore that even a 1% performance increase in a compiler release is considered pretty noteworthy.

anthony_r wrote at 2021-12-02 10:37:03:

This is a very specific problem that lends itself to some fantastically cpu-friendly optimizations. Doing a single hash-map lookup here on the hot path would kill performance, and let's not even get started on any serious I/O (disk or network) as that would be multiple orders of magnitude slower. Not many problems are just "cat something extremely predictable into /dev/null".

samhw wrote at 2021-12-02 13:12:40:

And so what? That's what this code does, and so a compiler should be able to generate the optimal assembly. Why would a compiler need to consider something that the code in question is _not doing_?

bawolff wrote at 2021-12-02 16:00:26:

How would the compiler know that? This code for example is only optimal if you are piping it somewhere instead of displaying on the terminal. There is no way, even in principle, for the compiler to know that this program is always run with stdout connected to a pipe.

Moreover, if you mean optimal in the absolute sense, pretty sure that is equivalent to solving the halting problem.

If what your asking is: why can't computers take all surounding fuzzy context into account and do precisely the right thing, the answer is: because we haven't invented strong AI yet.

samhw wrote at 2021-12-02 17:40:44:

Ah, sorry, I misunderstood you. I thought you were placing the emphasis on what the code was doing, rather than where it was being piped. But why do you think that most of the possible optimisations would only be relevant if the output were being piped to somewhere other than a terminal?

Also, why would making it absolutely optimal be "equivalent to solving the halting problem"? (Though either way, no, I'm not talking about producing 100.0000% optimal code - clearly there's plenty of daylight between the current degree of optimality and that one.)

bawolff wrote at 2021-12-02 20:29:59:

> Ah, sorry, I misunderstood you. I thought you were placing the emphasis on what the code was doing, rather than where it was being piped. But why do you think that most of the possible optimisations would only be relevant if the output were being piped to somewhere other than a terminal?

One of the primary optimizations is using vm_splice to avoid memory copies. vm_splice only works if the file descriptor is a pipe.

> Also, why would making it absolutely optimal be "equivalent to solving the halting problem"?

https://en.m.wikipedia.org/wiki/Rice%27s_theorem

Or more specificly, if a program always halts (without outputting things) then the best optimization would be to have it halt immediately but to know that you have to have solved the halting problem.

You're right of course, absolutes are a bit of a distraction we just need something good for 99% of programs. However that's still really hard, and an optimization applied in the wrong context will slow down a program. There's still lots of room for compilers to grow, but the types of optimizations in use in this example are at best probably NP-complete to determine if they apply (i dont have any evidence for that though)

anthony_r wrote at 2021-12-02 16:24:59:

Yeah you could optimize GCC to work really well for those problems, there's a great ROI for the "write something trivial really quickly into /dev/null" class of problems.

Let us know how this goes if you embark on this adventure.

samhw wrote at 2021-12-02 17:42:21:

If you genuinely understood me as saying that, then I should clarify, to be absolutely foolproof:

That is not what I'm saying. I'm talking about having a compiler which could produce more optimal code in any given case, or most given cases. I obviously am not suggesting that compilers optimise this and only this case.

anthony_r wrote at 2021-12-02 19:50:30:

> I'm talking about having a compiler which could produce more optimal code in any given case, or most given cases

Funny coincidence, that's exactly what the vast majority of compiler code is for, with tens (or hundreds) of thousands of man-years of work spent on that.

londons_explore wrote at 2021-12-02 13:17:39:

But nearly all problems boil down to nearly zero actual work...

Take booting up a computer for example. All that _actually_ needs to be done is to display the login screen on the screen. And we know from 60 fps video playback, that the hardware is capable of displaying a frame inside 16 milliseconds.

anthony_r wrote at 2021-12-02 16:23:26:

You can create an operating system that just displays a static image within milliseconds yourself (shouldn't take you more than a few weeks), it would work even when compiled at -O0.

pjc50 wrote at 2021-12-02 11:30:20:

Producing this solution requires not merely a clever compiler but an extraordinary amount of creative intelligence.

bonzini wrote at 2021-12-02 09:54:38:

The top 3 entries use completely different algorithms. The use of assembly might help a bit, but not as much as the raw numbers suggest.

DeathArrow wrote at 2021-12-02 09:26:14:

The most interesting thing is the C version is almost as fast as assembler version.

bonzini wrote at 2021-12-02 09:55:57:

The most interesting thing is that it took one day to write the C version and several months for the assembly version, because the C version uses a simpler algorithm. Talk about diminishing returns. :)

pokepim wrote at 2021-12-02 10:18:23:

And you could copy paste Python version from SO and have it running in less than a minute, so here’s the winner :)

bonzini wrote at 2021-12-02 10:39:28:

If the Python version is 2000 times slower, the breakeven would be after a few terabytes of FizzBuzz.

Qem wrote at 2021-12-02 12:51:10:

The Python version posted is about 85 MiB/sec, so only 600 times slower than the fastest one.

bonzini wrote at 2021-12-02 15:56:51:

Ah, forgot about pypy!

DeathArrow wrote at 2021-12-03 04:37:06:

Yes, but if you start the Python version and still have 5 minutes left from the day, the C version will overtake Python.

6510 wrote at 2021-12-04 22:27:42:

I would like to propose (or remind) that for this type of problem the fastest implementation is paper.

javier10e6 wrote at 2021-12-02 16:22:00:

Asm 56 GB/s (you are reading Sanskrit)

C 41 GB/s (you are reading US Tax Code)

Rust 30 GB/s (you are reading English)

Python (why bother) (you are speed reading )

So many tools, so many languages, life is great.

pxeger1 wrote at 2021-12-02 17:32:54:

This looks like a pretty rubbish blogspam regurgitation of the original SE answer.

w0mbat wrote at 2021-12-02 18:02:17:

I played with FizzBuzz when this was mentioned here a month ago, focusing on the algorithm not optimizations.

I stopped once I could do it in simple code with no math except addition, 15 numbers at a time.

for (int x = 0 ; x < 1000000 ; x+= 15) {
      printf("%d\n%d\nfizz\n%d\nbuzz\nfizz\n%d\n%d\nfizz\nbuzz\n%d\nfizz\n%d\n%d\nfizzbuzz\n",
            1 + x, 2 + x, 4 + x, 7 + x, 8 + x, 11 + x, 13 + x, 14 + x);
    }

That is simple enough and fast enough for me.

WithinReason wrote at 2021-12-02 18:33:18:

Did you check what the throughput of this is?

lovasoa wrote at 2021-12-02 23:59:13:

That's around 300MiB/s on my computer.

Terretta wrote at 2021-12-02 14:19:26:

YAWFB: Yet another wrong fizzbuzz.

“fizz” % 3, “buzz” % 5, what is the % 15 for?

You can blow a interviewee’s mind by saying, great, now “bang” % 7… and (hopefully!) watch the light dawn.

// why so serious? :-)

LeifCarrotson wrote at 2021-12-02 14:44:04:

The % 15 is because they're using 'elif' or 'else if'; they only process each entry once and they exit the function when a successful comparison is made.

It's true that you could do this with something like:

  result = ''
    if x % 3 == 0: result += 'fizz'
    if x % 5 == 0: result += 'buzz'
    if result == '': result += str(x)
    return result

to append fizz when for 15 and also continue processing and append buzz, but this requires an extra comparison to check that you did neither of those to print the number.

Terretta wrote at 2021-12-02 15:35:06:

we all know what the %15 is for.

but note that if you do 3 comparisons, or 2 comparisons and an "extra" comparison, you've done the same number of comparisons, it's not really 'extra'

then go ahead and keep adding factors and see which results in more comparisons as we rack up the actually extra comparisons for various combinations of factors

btw and ftw, your modulo and string test example, more correct than the usual implementation, is among the most efficient:

http://www.zoharbabin.com/which-fizzbuzz-solution-is-the-mos...

// still not so serious...

andi999 wrote at 2021-12-02 12:02:08:

I dont understand the FizzBuzz thingy. I mean isnt it just about who has the fastest print method? Correct me please, but this problem has period 15, so if you just write:

for (i=1;i<n;i+=15)

{ print i;

print i+1;

print 'fizz'

print i+3

...

}

You have the fastest , dont you? (if the ending n is important then one can just write 15 of such loops/endgame depending on the modulus). Since there is no 'if' the branch predictor should be on your side.

Tepix wrote at 2021-12-02 12:44:00:

You're printing the number too so, no, it does not heave a period of 15.

zkldi wrote at 2021-12-02 12:28:47:

Did you even read the OP?

andi999 wrote at 2021-12-02 12:36:46:

The OP is great. And basically it is about how to print faster. I was more commenting on the fast fizzbuzz challenge itself.

trevyn wrote at 2021-12-02 09:10:54:

I wonder how an M1 Max assembly version would fare.

SekstiNi wrote at 2021-12-02 13:12:13:

The Firestorm cores are capable of issuing two 128bit stores per cycle [1], giving a bandwidth of 102.4 GBps. This matches the experiments by Andrei with mixed reads/writes [2], and I have personally verified that it is attainable for pure write workloads to L1 as well as L2.

This does not mean 102.4 GBps is achievable on a single core, but if L2 bandwidth is the limiting factor it very well could be.

[1]

https://dougallj.github.io/applecpu/firestorm.html

[2]

https://www.anandtech.com/show/17024/apple-m1-max-performanc...

Croftengea wrote at 2021-12-02 11:02:45:

ARM64 SIMD instructions perform in the same ballpark, see this:

https://news.ycombinator.com/item?id=25408853

faeyanpiraat wrote at 2021-12-02 10:09:49:

Before running the code, make sure your CPU does support AVX2. Most 64-bit Intel and AMD CPUs should. ARM CPUs, like those found in newer Apple computers, Smartphones or Raspberry Pis, won't support AVX2.

Tepix wrote at 2021-12-02 12:42:44:

That sentence seems silly. ARM assembly and x86-64 assembly are two different beasts. Running the x86-64 code using Rosetta 2 emulation seems counter-intuitive as well if you want to get the fastest possible result on that machine.

comonoid wrote at 2021-12-02 10:23:49:

ARM CPUs have NEON. I cannot say if NEON has all equivalent commands, however, the threadstarter is talking about complete rewrite, not just running it as is.

innocenat wrote at 2021-12-02 11:29:07:

NEON only has 128bit register. AVX has 256 bit register.

comonoid wrote at 2021-12-02 14:18:58:

Yep, it will most likely require to reorganize the program, but by itself it doesn't necessary mean that the result will be slower.

opentokix wrote at 2021-12-02 14:37:43:

Imagine writing this asm-implementation to the 22 y/o google recruiter and fail the interview. :D

jprupp wrote at 2021-12-02 18:18:28:

Hold my beer. Now were is that OpenCL programming book when I need it?

sshb wrote at 2021-12-02 14:19:22:

I wonder about FPGA and io_uring implementations performances

hackater wrote at 2021-12-02 15:31:29:

How do you learn about this? Does anyone have any good recommendation on books, where you learn how to do these kind of very low level optimizations?

bXVsbGVy wrote at 2021-12-02 11:07:56:

How did they tested it? Is the data actually being delivered, and read, in order?

Maursault wrote at 2021-12-02 15:10:29:

No machine code implementation? I wonder how much faster than assembler it could be.

pxeger1 wrote at 2021-12-02 17:30:14:

Assembler has basically no abstractions on top of machine code; just a human-readable syntax. So there would be no difference.

GnarfGnarf wrote at 2021-12-02 12:27:19:

_581 lines of Assembler_

Interesting, but not very practical.

chii wrote at 2021-12-02 12:33:33:

is practicality really relevant to a competition to write the fastest FizzBuzz implementation?

sillysaurusx wrote at 2021-12-02 08:46:24:

As of this writing, the 3rd fastest submission is written in Rust and produces out at a rate of 3 GB/s, 2nd is written in C and produces at a rate of 41 GB/s and the fastest is written in Assembler and produces at a rate of 56 GB/s.

This makes me happy. The order is exactly as it should be, much to the disappointment of the Rustaceans.

It's a nice reminder that for the ultimate performance, there is no substitute for knowing exactly what the assembler is doing.

It's also a nice reminder that most of us will never want to do that.

EDIT: Whoa. It's late, and I thought Rust was at 30 GB/s instead of 3, i.e. roughly neck and neck with C. I meant to make a good-natured joke. There must be something wrong with the Rust implementation to be 10x slower; I wonder what it is.

As I mentioned further down in the comments, I'll donate $500 to Watsi if someone manages to displace the C implementation in Rust or any other language besides C/C++/ASM. Be sure to DM me on Twitter (

https://twitter.com/theshawwn

) if you do, both so I'll see it and so that I can congratulate you. I'm sure it's possible, but it's sort of interesting that the activation energy is so high.

pcwalton wrote at 2021-12-02 10:08:47:

You can basically take any LLVM IR that Clang would produce and produce Rust that compiles down to it (modulo irreducible control flow, I think, but the C++ doesn't have that). Because of that, C++ vs. Rust _language_ (as opposed to library) performance comparisons are uninteresting.

> It's a nice reminder that for the ultimate performance, there is no substitute for knowing exactly what the assembler is doing.

C and C++ do not let you know exactly what the assembler is doing. I work on compilers for a living and despite that--perhaps _because_ of that--I have no confidence that I can predict what instructions the compiler will generate ahead of time. If I need to know the actual instructions, I run my code through Godbolt and check. C, C++, and Rust are all millions of lines of LLVM optimization passes away from the metal.

Based on some other comments, I suspect that all the time is being spent in locking stdout, which you can amortize by using Stdout::lock [1]. As is so often the case when microbenchmarks are discussed online, this has nothing to do with the language.

Also, it shouldn't need to be said, but FizzBuzz is so totally unrepresentative of real workloads that you shouldn't draw any conclusions about language performance from it.

[1]:

https://doc.rust-lang.org/stable/std/io/struct.Stdout.html#m...

parhamn wrote at 2021-12-02 12:59:09:

> LLVM IR that Clang would produce

You most certainly can’t unless your code is unsafe and in that case the comparison isn’t so interesting anymore.

Aissen wrote at 2021-12-02 13:38:51:

It is very interesting if you have properly verified invariants and you can guarantee (with manual checking) that your code is correct. It's an order of magnitude harder than just letting the borrow checker do its job, but it's entirely possible (just like in any other language).

bonzini wrote at 2021-12-02 09:51:37:

The second one (written by me) is actually JIT-compiled assembly. So little of the runtime is spent in the C code that -O0 vs -O3 makes no difference at all; and even if you wrote the glue code in Python it would not be much slower than the C code.

Also the third entry is not the one in Rust. It's just that the Rust entry came later and thus was not included in the plot. The third entry is

https://codegolf.stackexchange.com/a/215236/56694;

it is also C.

> displace the C implementation in Rust or any other language besides C/C++/ASM

This is not a language contest, it's an algorithmic contest: that is, it's the best algorithm that wins, not the best compiler. The top entry, converted into C or Rust by mostly menial work, would probably be only a few % slower and would displace the C implementation.

stavros wrote at 2021-12-02 10:11:00:

Now I'm kind of hoping that someone will rewrite the glue in Python so that Python is the second fastest implementation of them all, just for the laughs.

animal531 wrote at 2021-12-02 10:27:11:

So what we're seeing is that a specialized handcoded assembler version is 10x faster than a generic C version?

If so that makes me a bit sad, because in the modern day we'd want to assume that compilers have gotten a little better at optimization.

jiggawatts wrote at 2021-12-02 10:34:11:

Read the assembly.

Seriously. It's eye-popping. It's eldritch madness. It's black wizardry. The blackest of the black arts.

The "fastest FizzBuzz" implementation isn't merely "fast". It's _faster than memcpy()!!!_

Expecting compilers to outperform this tour-de-force is asking a tad too much from compiler writers...

btilly wrote at 2021-12-02 12:33:55:

From the article, _"The application produced 64 bytes of FizzBuzz for every 4 CPU clock cycles. The author states the ultimate bottleneck of performance is based on the throughput of the CPU's L2 cache."_

For comparison, that time is on par with integer multiplication.

noduerme wrote at 2021-12-02 11:17:34:

It's actually really nice the way the author commented it. It's a cool insight into their mental process. I've never coded in Assembly and I can't normally follow it. But reading these comments I can understand how the author is surgically managing registers and operating off a complete mental map of what's there, the same way a coder would with, say, a very large database they have a complete mental model of.

All I mean is that I appreciate the author making their thought process accessible. It certainly looks like virtuosity, but I'm not competent enough to judge.

nojs wrote at 2021-12-02 11:27:11:

The assembly is ridiculous. Random section:

      // The bytecode generation loop is unrolled to depth 30, allowing the
    // units digits to be hardcoded. The tens digit is stored in %al, and
    // incremented every ten lines of output. The parity of the hundreds
    // digit is stored in %ymm2: one half predicts the hundreds digit to
    // be even, the other to be odd, and the halves are swapped every time
    // the tens digit carries (ensuring the predictions are correct).

entropicdrifter wrote at 2021-12-02 15:05:58:

Controlled CPU predictions, wow. That's a level of performance I've never even contemplated

bonzini wrote at 2021-12-02 15:46:43:

It's not CPU predictions, it means that it computes in parallel one result that is correct and one that is wrong, and picks the right one by swapping the two halves every now and then.

bonzini wrote at 2021-12-02 10:37:11:

> assume that compilers have gotten a little better at optimization

Again: you're comparing the performance of different algorithms. A compiler won't turn bubblesort into quicksort.

tialaramex wrote at 2021-12-02 11:15:02:

I mean, this is importantly true as a generalisation, but, it's worth knowing modern compilers know about _idioms_.

If you show a modern compiler the typical idiomatic blob of code for counting the set bits in a word in a language like C, the compiler goes "Oh, _popcount_ I know this" and when targeting modern CPUs it emits a single instruction, the popcount instruction for that CPU.

In APL idiom recognition was essential to making even a halfway decent compiler. It's easy, and idiomatic, to write APL which would be tremendously wasteful if implemented naively, while recognising common idioms could remove lots of that waste without needing the level of sophistication modern compilers have during optimisation.

Because the compiler knows about idioms you should _not_ try to optimise your programs by writing non-idiomatic code _until_ you've determined you have a performance problem and the non-idiomatic code is identified as a way to fix it. Otherwise it may be that the compiler improves and now your code is not only harder to maintain it also has worse performance than the idiomatic solution.

rowanG077 wrote at 2021-12-02 11:26:45:

Not yet. But I have no doubt in my mind that a once we have smarter then human AI there will be compilers that absolutely can turn bubblesort into quicksort.

marginalia_nu wrote at 2021-12-02 13:08:01:

Seems fairly unlikely. The best sorting algorithm is a function of the data. The compiler can't know this. There are pathological data sets where quicksort is O(n^2). For small n, quick sort is also a bad choice.

The programmer can understand these domain characteristics but the compiler certainly can't. If code isn't running qsort(), this could be a mistake, but may also be an optimization.

rowanG077 wrote at 2021-12-02 15:36:01:

You almost always know something about the data that a sorting function will operate on. Unless of course you are implementing a generic sorting algorithm. In practice you could train your compiler on real world data to allow it to optimize better for your environment. This is already happening with profile guided optimizations.

gpderetta wrote at 2021-12-02 14:19:26:

PGO is a thing for this reason.

(But no, I'm not expecting AI driven compilers anytime soon)

foob wrote at 2021-12-03 03:05:22:

I think it's unfair that people downvoted this to gray. This person didn't claim that this was about to happen, and it seems likely to me that when/if AI surpasses human intelligence that something like this would be possible. It might require some additional way to describe expectations about the input data, and it might be many years away, but I would be disappointed if compilers couldn't do this in the far distant future. Look at the progress we've made in the last seventy years and try to extrapolate that out a thousand years forward. I obviously can't guarantee what will happen, but is it really that unreasonable to suggest progress like this as a strong possibility?

sillysaurusx wrote at 2021-12-02 10:12:45:

I would _love_ to see a Python implementation that approaches the C code. It would be educational on so many levels. Would anyone be interested in trying?

Also, why is the ASM version so much faster if the glue code doesn't make much difference? I suppose it matters at the margins...

bonzini wrote at 2021-12-02 10:20:53:

> why is the ASM version so much faster if the glue code doesn't make much difference

The remark on the glue code only applies to my entry (the second). The assembly entry uses a completely different (and much more complicated) algorithm than the C entry.

hectormalot wrote at 2021-12-02 10:24:30:

If you read the description you’ll see that it’s optimizing around certain syscalls like memcopy (which is otherwise limiting and reduces speed by 5x). I think it’s less language choice, and more low level optimization that is making a difference here.

(To check, I wrote a Go program that doen nothing else than fmt.Print($largebody) in a for loop and it tops out at around 10GB/s)

tux3 wrote at 2021-12-02 10:50:19:

Note that memcpy is not a syscall. Common syscalls do copies. The fast implementation uses the vmsplice syscall as a clever way to pump a pipe with data, because it doesn't copy.

tialaramex wrote at 2021-12-02 11:26:02:

Indeed. Today for most systems memcpy is in fact a compiler intrinsic. C's standard library offers a memcpy but the reality is that your C standard library was likely supplied by your C compiler vendor, and when you ask for memcmp() in C you get their preferred intrinsic inlined into your program unless you've been very firm in declining that optimisation.

In Rust this memcpy intrinsic is required (for the core language, not just the standard library) because Rust clearly has arbitrarily large data structures you can literally copy, so, somebody has to take responsibility for doing that work, Rust says it's the compiler back-end's job. On a toy CPU the memcpy might really just be a trivial read->register->write loop because that's adequate.

tialaramex wrote at 2021-12-02 16:07:50:

Oops, I notice hours too late to edit it, that this says memcmp() in one place where it mostly says memcpy(). In fact your C compiler likely has intrinsics for both, but I did mean memcpy() everywhere in the comment.

Qem wrote at 2021-12-02 12:54:26:

The Python implementation posted among the answers is about 85 MiB/sec, under PyPy.

bartvk wrote at 2021-12-02 08:59:49:

It's not just knowing exactly what the assembler is doing, but also what the kernel is doing (but maybe you also meant this). "This implementation only supports Linux", and then continues about the use of a bunch of system calls:

* "asks the Kernel to defragment the physical memory it's using via the madvise system call"

And then the most fascinating bit: "The CPU's L2 cache is split into two halves, one for calculating the output of FizzBuzz, the other storing the result of the last calculation batch. This means there is no need to use any slower L3 memory operations."

"Slower L3 memory operations", I've not read that sentence before :D

MayeulC wrote at 2021-12-02 15:24:49:

> the most fascinating bit

It is really fascinating, but I disagree a bit with your pick :)

To me, the most fascinating part is using a bytecode interpreter. It makes perfect sense, it's basically optimizing the code size to fits in the cache. It's also a bit like defining a whole new instruction set dedicated and optimized for Fizzbuzz!

I think the approach was a bit similar to a VLIW

https://en.wikipedia.org/wiki/Very_long_instruction_word

where a single 32 bits word is loaded and interpreted at a time.

Most of the state of the Fizzbuzz program is encoded in this bytecode.

> "Slower L3 memory operations", I've not read that sentence before

You hear it all the time when trying to optimize for speed in a hot loop. Keep your data structures in a way that fits in the cache. Align your data with cache boundaries. Code needs to be small too, hence the bytecode approach. This is also the reason why you can't unroll loops too much.

If your implementation does not need to fetch (or write) data from(/to) higher levels of the memory hierarchy, you'll see tremendous gains. There's even this bit in the original source: "_while waiting for writes to go through to the L2 cache_" :)

There are a few more very interesting hacks, such as the "high-decimal" format that skips the "_need to do binary/decimal conversions._"

GrayShade wrote at 2021-12-02 08:53:57:

You do seem to have an axe to grind with the people using Rust.

I don't see any reason why the same thing [1] could not be implemented in Rust.

EDIT: I got caught by an off-by-one there, the second fastest solution (doing JIT compilation) is actually [2].

[1]:

https://codegolf.stackexchange.com/a/215231

[2]:

https://codegolf.stackexchange.com/a/236737

sillysaurusx wrote at 2021-12-02 08:54:46:

Much like speedruns, it's up to you to prove it :)

I just enjoy dosing the Rusties with reality every so often.

If you write a Rust version that can beat the C version, I'll donate $500 to Watsi. Good luck!

KingOfCoders wrote at 2021-12-02 10:19:52:

"I just enjoy dosing the Rusties with reality every so often."

Aka troll.

sillysaurusx wrote at 2021-12-02 11:04:36:

Oh? Why is it trolling to say that Rust isn't as fast as C in this case?

If it were, you'd claim the prize. If you really want to hurt a troll, wouldn't you try to separate them from their $500?

KingOfCoders wrote at 2021-12-02 13:44:20:

You still assume I care. As stated above I don't care. It's in your head only that every Rust developer writes Rust because she thinks it's faster than C.

detaro wrote at 2021-12-02 11:13:18:

> _to say that Rust isn't as fast as C in this case_

That's not what the quoted statement is doing.

sillysaurusx wrote at 2021-12-02 11:26:02:

I think we'll have to agree to disagree. It's illustrative that a seemingly simple C program can't be easily ported to Rust. If it could be, someone would have already done it -- it's been a few hours.

It's also illustrative to examine exactly what the hangups are. You might even get more Rust converts if you acknowledge that there are serious engineering burdens to overcome when deciding to use Rust.

samhw wrote at 2021-12-02 11:50:24:

For the record, it's not the anti-Rust sentiment which people find annoying, it's the treatment of programming languages as though they were football teams.

If you want to make a case against Rust, then let's get concrete: _why_ is one Turing-complete compiled language that can make arbitrary syscalls not capable of compiling down to the same assembly as another?

_That_ would be an interesting discussion; this is not.

Retric wrote at 2021-12-02 12:53:23:

You can dig into it, but the details are mostly only relevant to people designing computer languages and tooling.

Aka Compilers, as long as the machine code is equivalent to the source code their output can be wildly different. However, someone needs to actually write them which means both linguistic differences and the sheer effort that was actually put into the compiler are both significant. You shouldn’t assume that every language is operating on an even playing field because their simply not. The obvious solution of compiling to a second language is never quite as capable as using the language directly.

samhw wrote at 2021-12-02 13:07:25:

> Compilers, as long as the machine code is equivalent to the source code their output can be wildly different

I don't understand what this means: the output of a compiler _is_ the machine code from the source code, no?

Also, Rust uses LLVM as a backend, which has benefited from lots of work given its use for C, C++, ObjC, Swift, &c. Compiling the C with Clang/LLVM might give a more interesting insight into the differences between Rust and C.

> The obvious solution of compiling to a second language is never quite as capable as using the language directly.

I'm not sure what you mean by this, or rather how it's related to what we're talking about?

Retric wrote at 2021-12-02 13:42:06:

> The output is the machine code.

You get different machine code using different compilers on identical pieces of source code. In theory the results of running that source code should have equivalent results even if one is more efficient than another. If for example one compiler uses a ASM instruction that another never uses then nothing the programmer does can get the second compiler to be as efficient as the first. The same is true of various optimizations etc, sometimes a compiler is just better.

Bring a different language into the picture and things get even more divergent.

> how it’s related

It’s specifically in reference to rust use of LLVM not being sufficient to guarantee equivalence. Many languages target the JVM, that does not make them equivalent.

samhw wrote at 2021-12-02 17:46:03:

> You get different machine code using different compilers on identical pieces of source code.

Ah, right - well I'm certainly not denying that. (I'm familiar with the topic, to be clear: I've written compilers. I just wasn't clear what exactly you were saying.)

> It’s specifically in reference to rust use of LLVM not being sufficient to guarantee equivalence.

Formally, no. In practice, given that fact, given the experience of the maintainers of both languages, and given the 'compilability' of both languages, it would be surprising if there were a 10x gulf between them.

taneq wrote at 2021-12-02 10:40:41:

That kind of behaviour (edit: theirs, not yours) is usually ultimately motivated by fear. ;)

AnIdiotOnTheNet wrote at 2021-12-02 12:50:10:

Not necessarily. Language evangelists can be incredibly annoying... actually, evangelists in general can be really annoying. It is cathartic to prove annoying people wrong every once in a while.

GrayShade wrote at 2021-12-02 13:30:57:

I don't know. I use Rust myself, but I've seen more people complaining about Rust evangelism than people actually doing that.

The RESF is a bad meme.

colejohnson66 wrote at 2021-12-02 14:40:57:

When there's a memory related bug (or CVE) in a program written in C or C++, there's almost always some comment about how the Rust compiler would've caught it

tialaramex wrote at 2021-12-02 16:38:12:

On HN there is often: A comment about Rust would catch this (sometimes sarcastic), an explanation about how the bug isn't "really" memory safety and so Rust wouldn't catch it, followed by a comment explaining actually Rust does catch it.

The recent Linux mutex security bug is an example. That's not a memory safety bug. Surely Rust doesn't help? But Rust's mutual exclusion design relies on the Rust type system to give us safer locks, in Rust you protect things by having "lock the mutex" be the way you get the reference to the protected object. Thus when the faulty code locks the wrong mutex in Rust it would get the wrong object, and either _not compile_ (you lock variable A and then you're trying to change B but you don't even have B) or _not work_ (you always lock variable A, then change variable A, but your code should change B, it does not work, trivially reproducible bug in your code)

KingOfCoders wrote at 2021-12-02 16:13:10:

From Rust developers or from app users? As an app user I would prefer if apps are written in Rust and Im reminded with every CVE for a C program.

Ygg2 wrote at 2021-12-02 12:56:55:

Otoh. Being a dick, to majority of people because of acts of minority is worse.

trevyn wrote at 2021-12-02 09:03:50:

No no, C is great for writing fast FizzBuzz implementations; any larger program is hard to get correct. :-)

gerbilly wrote at 2021-12-02 15:42:47:

Ultimately, your argument is a troll because you are trying to tell people what they should value in a programming language.

bawolff wrote at 2021-12-02 16:09:29:

Pretty sure the fact its literally a speed contest is telling people what to value in a programming language.

KingOfCoders wrote at 2021-12-02 10:18:46:

"disappointment of the Rustaceans."

I'm using Rust and I'm not dissappointed. Why should I?

But I do appreciate how you create imaginary people in your head, troll them and mentally pad yourself for doing so.

isoprophlex wrote at 2021-12-02 11:35:25:

All the rust users rising to the bait in this thread sure aren't imaginary

KingOfCoders wrote at 2021-12-02 13:46:49:

What comment comes to your mind of a Rust developer who thinks Rust is faster than C and needs a dose of reality?

colejohnson66 wrote at 2021-12-02 14:39:09:

Rust is a memory safe _systems_ programming language. As a systems programming language, it can be used in the same areas that C can (see Redox OS). I've seen quite a few Rust users online that have claimed that, because of that, it can be just as fast as C.

tialaramex wrote at 2021-12-02 16:28:48:

But it can be _just as fast_ as C. The C program that's so much faster here isn't "really" a C program it's just a fancy way to shovel machine code, which you could of course also do (unsafely) in Rust if you chose to do so but why?

However what you were asked about is claims it's _faster_ and there micro-benchmarks like this are not revealing. At scale you're obliged to limit yourself to what you can write _that works_ and in Rust the safety lets you reach higher - which is how it achieves better performance. In fact that's what Mozilla used Rust for. They'd failed more than once to ship a concurrent CSS speed-up, finding it too hard to successfully get the bugs out and so returning to a single threaded CSS implementation - but the attempt in Rust (Quantum) worked just fine.

This is the same reason C++ is faster, a C++ iterator isn't _faster_ than the naive C for loop, in fact they emit identical code, but the iterator is easier to think about, which means you can confidently write more complicated software using this abstraction. Rust just provides the safety to go higher still.

KingOfCoders wrote at 2021-12-02 16:14:18:

What is the opinion of Golang users on this?

politician wrote at 2021-12-02 17:00:43:

This is really cool. It should be obvious that a few thousand hand-rolled assembly opcodes targeting a fixed architecture would win the race. If the size of the solution grew a couple of orders of magnitude larger though, I'd expect that the micro optimization challenge would overwhelm the author and compilers would be more competitive.

Is the Table of Contents of your upcoming book public?

KingOfCoders wrote at 2021-12-02 17:57:50:

No not yet, incorporating first draft feedback.

You made my evening for asking <3

gopiandcode wrote at 2021-12-02 12:38:25:

You're totally right. I think a lot of rust developers are going to be absolutely distraught that rust isn't the fastest language for writing fizzbuzz programs.

I guess they're just going to have to settle with rust being the best language to write safety-critical high performance concurrent low-level code.

tester34 wrote at 2021-12-02 11:05:06:

>This makes me happy. The order is exactly as it should be, much to the disappointment of the Rustaceans.

Why it makes you happy? it's a reminder how much we're losing performance, energy and stuff when we're writing in safe/expressive/easy to use languages like Java, C#, Python yada yada

yodelshady wrote at 2021-12-02 10:03:08:

FYI, on my machine a naive Rust implementation (modulo plus println!) runs at the same speed as a naive _Python_ implemention (16-17 MB/s), even compiled with --release. Naive C manages 160-170 MB/s. (And I fully admit the naive implementations are pretty much as far as I'm going with this).

I would guess it's the println! macro eating the time.

beltsazar wrote at 2021-12-02 16:23:07:

There are two reasons: 1) The println! macro will lock and unlock the stdout every time it's called. 2) Rust's stdout is line-buffered by default.

In my machine the naive C implementation achieves 81.7 MiB/s, whereas a naive Rust implementation with println achieves 7.58 MiB/s.

When I acquired the stdout's lock once and use writeln!(&mut stdout, ...) instead of println, I got 8.15 MiB/s.

When I wrapped stdout with BufWriter, I got 299 MiB/s.

    use std::io::{BufWriter, Write};

    fn main() {
        let stdout = std::io::stdout();
        let stdout = stdout.lock();
        let mut stdout = BufWriter::with_capacity(32 * 1024, stdout);

        for i in 1..1_000_000_000 {
            let _ = match (i % 3 == 0, i % 5 == 0) {
                (true, true) => writeln!(&mut stdout, "FizzBuzz"),
                (true, false) => writeln!(&mut stdout, "Fizz"),
                (false, true) => writeln!(&mut stdout, "Buzz"),
                (false, false) => writeln!(&mut stdout, "{}", i),
            };
        }
    }

yodelshady wrote at 2021-12-03 14:04:52:

Well TIL that printf won't lock stdout.

Calling stdout.lock() by itself makes little difference for me. Using BufWriter as you have I get ~80 MiB/s (slightly ahead of PyPy).

My implementation also has/had a String allocation in the loop, kicking that out got me to 120 MiB/s. I tend to write it as though the numbers/strings could be changed by a product manager at any time.

Final edit/errata: I get up to 140 MiB/s by appending .to_string() in the integer case instead of format!(), which makes some sense.

I didn't bother saving my original C implementation, but in retrospect I think I didn't use string concatenation, as I have with other languages. Redoing it with string-concatenation, C is slightly _slower_ than Rust, 110 MiBs. Also makes sense as concatenation in C is slow.

pcwalton wrote at 2021-12-02 10:15:23:

It's probably stdout locking. Call Stdout::lock.

yodelshady wrote at 2021-12-02 13:03:17:

edit: weirder, naive Julia (1.7 on Linux) is _terrible_. 45 KB/s.

Ruby is slightly slower than Python or Rust, 11 MB/s. I was expecting Python to be slowest among the interpreted languages, apparently not. Pypy yields a speedup to 70 MB/s.

Summing naive implementations (for each i: start with empty string, append fizz/buzz based on the modulus of each, then conditionally append i and print):

Julia 45 KB/s

Ruby 11 MB/s

CPython 15 MB/s

Rust 16 MB/s

Pypy 70 MB/s

C 160 MB/s

sc11 wrote at 2021-12-02 13:21:25:

Do you mind sharing the Julia code? Seems odd that it's that far away from even Python.

yodelshady wrote at 2021-12-03 13:44:02:

for i = 1:typemax(UInt64)
    s = ""
    if i % 3 == 0
        s = s * "fizz"
    end
    if i % 5 == 0
        s = s * "buzz"
    end
    if s == ""
        s = string(i)
    end
    println(s)
end

Measured over 10 s after a 10 s burn-in period, Julia 1.7 on Ubuntu 21.10.

adgjlsfhk1 wrote at 2021-12-02 14:16:58:

by default Julia uses locking for writes to guarantee thread safety when multiple writes are happening. That said, it's totally possible to go way faster using buffered IO.

Qem wrote at 2021-12-02 14:10:09:

I think recent versions of Ruby include a optional JIT compiler. Did you try it?

lawn wrote at 2021-12-02 09:33:03:

So you think that it's expected that C is 10 times faster than Rust?

sillysaurusx wrote at 2021-12-02 09:46:50:

Wow. Perhaps I should be embarrassed, but I mentally filled in "30 GB/s" instead of 3 GB/s. I thought Rust was mostly neck and neck with C. I see now why a lot of fellows responded in a defensive way.

What's it doing to be so much slower? That's shocking. I may not _enjoy_ Rust, but I do respect its ability to approach C++ performance.

Arnavion wrote at 2021-12-02 09:57:45:

At the very least, the itoap crate being used there does divisions by and comparisons to powers of 10 as part of stringifying integers, which the C and ASM versions don't seem to be doing. A crate with a function for stringifying arbitrary integers cannot take advantage of the fact that it's going to be used for stringifying consecutive integers, as an inlined solution can.

misja111 wrote at 2021-12-02 13:20:09:

As far as I understand, that #2 C program contains hardly any C code but mostly inlined assembly.

bonzini wrote at 2021-12-02 15:49:18:

The program itself is 100% C, but it spends 99.9999% of the time in assembly code that is generated at runtime.

IshKebab wrote at 2021-12-02 12:48:39:

> That's shocking.

Not really. This is a hyper optimised microbenchmark. The actual work being done is so trivial that any "overhead" is really significant and you can get order of magnitude speedups by doing arcane things with caches, eliminating tiny inefficiencies and so on.

xxpor wrote at 2021-12-02 15:43:15:

Complete speculation, but I think you'd find that's true for most problems. Very little CPU time is spent on the business logic, and the rest is overhead.

friedman23 wrote at 2021-12-02 15:33:22:

I'm curious, how did rust affect you so badly that you hate it so much? If it's just silly people suggesting it everywhere I think you have some personal issues you need to resolve, language choice isn't something to get emotional about.

ask_b123 wrote at 2021-12-02 17:08:08:

I've not seen any indication that sillysaurusx _hates_ Rust.

jedimastert wrote at 2021-12-02 15:08:15:

> Rustaceans

Completely off topic, but suddenly all of the crab branding makes a _ton_ more sense

Arnavion wrote at 2021-12-02 09:38:55:

>It's a nice reminder that for the ultimate performance, there is no substitute for knowing exactly what the assembler is doing.

I didn't realize viewing the assembly of a compiled binary was something reserved for C. Someone should tell rust.godbolt.org that it's not supposed to exist.

dahfizz wrote at 2021-12-02 13:37:25:

Go read the second place C entry[1]. Its a JIT for assembly code.

[1]

https://codegolf.stackexchange.com/questions/215216/high-thr...

samhw wrote at 2021-12-02 12:10:34:

I think their point is that C is sufficiently close to assembler that you know exactly what assembly any given line of C will produce. This is an old canard which is not really true (

https://zenhack.net/2015/08/02/c-is-not-portable-assembly.ht...

), but it's common among people who derive a sense of identity from writing C.

recentdarkness wrote at 2021-12-02 08:51:47:

interesting for me was the fact, that the author talks about themselves in the third person. Very unusual for some content like this :D

lifthrasiir wrote at 2021-12-02 08:56:22:

The author of this post (Mark Litwintschik) is not same to the author of the assembly solution (Alex Smith, also known as ais523) :-)

nickdothutton wrote at 2021-12-02 10:10:36:

This chap is the developer. Kudos to him.

https://www.wolframscience.com/prizes/tm23/alex_smith_bio.ht...

Croftengea wrote at 2021-12-02 08:57:00:

> The developer behind the Assembler version goes by the handle "ais523".

I've been unable to uncover this person's real-life identity

isodev wrote at 2021-12-02 12:35:53:

Come on now, are we really using fizzbuzz as a performance benchmark? Seriously, don't base a real-world decisions on this.

PS. The Rust implementation is also not very optimal, for example, the stdout is not locked so... what are we even trying to show :-).

tourist2d wrote at 2021-12-02 12:48:15:

Can you really not tell the difference between a fun challenge and a real world scenario?

gumby wrote at 2021-12-02 15:49:35:

Pfft. No machine learning was used?