💾 Archived View for dioskouroi.xyz › thread › 29435263 captured on 2021-12-04 at 18:04:22. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

Do we really need undefined behavior?

Author: ibraheemdev

Score: 49

Comments: 27

Date: 2021-12-03 20:51:48

Web Link

________________________________________________________________________________

Ericson2314 wrote at 2021-12-04 17:43:59:

One thing I don't see remarked upon enough is that UB fundamentally is about the _logical_ side of programming languages, in particular

https://en.wikipedia.org/wiki/Principle_of_explosion

. This quote:

Here I refer to the typical modern interpretation of UB: assumptions the compiler may trust, without bounds on what happens if they are violated.

Makes that abundantly clear.

This basic realization opens up many avenues for interdisciplinary research. For example, maybe all the ink spill on "paraconsistent logics" will finally be useful for designing an unsafe language like C. (

https://plato.stanford.edu/entries/logic-paraconsistent/

https://ncatlab.org/nlab/show/paraconsistent+logic

.) Maybe "unsafe" in Rust is best understood as a logical "mode" of some sort. Lattice of allowed behaviors lead to a nice fine-grained semantics for semi-defined behavior.

I honestly don't know whether no one talks about this because it is very obvious to researchers in the field, or because no one actually dove deep into this interpretation.

Ericson2314 wrote at 2021-12-04 22:47:31:

I got those first comment upvotes, but no replies, and so the mystery of whether this is obvious and I am preaching to the choir, or this is a genuinely enlightening point, is not resolved :)

Someone wrote at 2021-12-04 22:17:13:

I think the first question to ask is whether we need a language that does away with array bounds checking.

If, like C, you think you do, you accept (on currently popular CPUs) that code may read from and write to arbitrary memory, and that means you have to define that as “anything can happen”.

If you don’t, I think you can get rid of the most common causes of UB.

So, with modern compiler technology, are languages that check array bounds necessarily slower than those that aren’t? If so, is the difference large enough?

spacechild1 wrote at 2021-12-04 22:43:05:

> So, with modern compiler technology, are languages that check array bounds necessarily slower than those that aren’t? If so, is the difference large enough?

For performance sensitive tasks - definitely! I certainly don't want array bound checking in my (soft)realtime audio processing code.

As a side note: when I write an audio plugin, it's the responsibility of the host application to tell me the correct buffer sizes, so I basically deal with trusted input. It's still possible to mess up, of course, but personally I found writing DSP code to be pretty safe.

stncls wrote at 2021-12-04 18:44:03:

The arguments I've read for or against undefined behavior are always:

1. fully binary: "[all] UB is nonsense" vs. "[all] UB is necessary for performance"

2. qualitative / anecdotic: "look at this compiler doing something unexpected" vs. "look at this optimizing compiler's transformation that would be impossible without UB".

Instead I would love to see a more fine-grained and quantitative study. There are _many_ different types of UB. I'm sure every one of them is exploited by optimizing compilers. However, what would be the quantitative impact of disabling each of them on, say, SPECINT benchmarks? Of course some types of UB-related assumptions are deeply embedded in compilers, so it's clearly a question easier asked than answered. But I think simple things like signed-integer overflow handling can be changed (in clang at least?). Has anyone done benchmarking with this? Are you aware of any papers on the topic?

vlovich123 wrote at 2021-12-04 19:42:00:

People focus too much on SPECINT. What I’d like to see instead is the companies that are already running lots of real world benchmarks report behavior changes. Then you can get more meaningful real world results. Now obviously this isn’t something you could develop locally so these companies would have to develop tooling to be able to gather these kinds of reports. It would be extremely beneficial good for compiler authors (eg “run experiment X”). Rust has the best setup I’ve seen in that it has both microbenchmarks and non-microbenchmarks and looks for statistical significance across all workloads rather than hyper focusing on a small thing.

solmag wrote at 2021-12-04 19:54:26:

I think one of Jung's comments was that we don't really know the performance benefits of UB in that blog post.

OJFord wrote at 2021-12-04 18:37:19:

My reaction to the title was 'well sure, Do we really need tax or legal loopholes and grey areas' either, like the question doesn't really make sense, but actually for the author's position seems like 'Do we really _not_ need undefined behaviour' would be a better fit. I.e. they're arguing that it's useful in response to a claim that it should be eliminated.

nayuki wrote at 2021-12-04 21:45:50:

A short note about the content - the step of creating an out-of-bounds pointer is UB, and the step of dereferencing the pointer is also UB:

int *x = j + 32/4;
    
    *x = 40;

Thus, you can argue that if UB were to be banned, it would be better to crash at the point of creating the out-of-bounds pointer, instead of accommodating the out-of-bounds write by always reloading every read from memory in case of aliasing.

Jweb_Guru wrote at 2021-12-04 22:24:49:

How do you know it's out of bounds, in general? C pointers aren't required to store their size at runtime, and even if they did and you were inside the extent the pointer might have already been freed due to aliasing (which can be impossible to track if your language allows integer-to-pointer casts). Of course in some obvious cases, it would be good to detect this (and many compilers try), but in cases where it's obvious isn't it better to error at compile time rather than runtime?

clktmr wrote at 2021-12-04 19:37:02:

"Do we really need UB?" is the wrong question asked. "How much UB do we need to achieve the design goals of our language?".

I think introducing UB while planning tight integration in the language's tooling beforehand has lots of upside. Even GCC's ubsan does quite a good job, although created retrospectively.

saurik wrote at 2021-12-04 19:56:37:

FWIW, some of the biggest cases of undefined behavior are ones that, as far as I am concerned, are workarounds for legacy code: things like how older code uses "int" as the counter for every single for loop, despite this being an incredibly stupid type for this purpose... compilers are then competing on how fast they make the existing ecosystem of code and so add awkward detection cases like "if it sort of looks like you only increment it I will decide you didn't want overflow and that way I can avoid having to simulate it now that int is 32-bits but the CPU is 64-bits".

But once compilers start adding these special cases then new code has to be written to understand "if I add a bounds check on some wrapping operation this compiler thinks I am insane and removes it as it assumes I am an idiot using the wrong type". And once compilers start doing that, particularly with incompatible heuristics, the concepts of what the language is even guaranteeing begin to erode and the standards adjust to define "undefined" behavior.

But the correct solution wasn't to fix the compiler... it was to use the correct data type! Whether that feels like it should be size_t or uintptr_t I will refrain from providing an opinion on in this comment (to avoid bike-shedding), but the premise that the compiler is a billion little attempts to figure out how to make my code faster--rules that change constantly and see under-specified--when it is the code that is wrong is extremely frustrating.

If you want to avoid some aliasing use case, or make math operations faster, or avoid issues with the calling convention on "this" pointers, or whatever it is... we should agree on what the code _should_ look like to make those semantics clear and then _forcibly deprecate as non-compliant_ any implementation that uses strange quirks to auto-detect and "improve" old code.

Alternatively, we should have a special mode the compiler should support to "just do what the code says even if it is slow" and we should then guide developers in the direction that it is a "code smell" to stay in quirks mode: that if there is a way to make the "strict" compiler mode understand your code correctly and still be as fast as the one with tons of undefined behavior, it is your code that is at fault.

But like, the market incentives will never support this: people evaluating between two compilers are going to run benchmarks against projects that they have been using for decades and choose the compiler that "works better"; and meanwhile, convincing developers to stop using "int" or explicitly call out aliasing or whatever will never happen because people will benchmark the code against a compiler with all these bells and whistles and decide "the risk of making these modifications to every single place in our code where it might matter is you great as we might break something in the process and the compiler seems to compile our code just as fast without it being explicit (due to UB)" with a side of "the explicit syntax for that is caught up in committee and isn't supported on one of our key targets yet".

So, I personally consider this all a "lost cause". But I also really want to point out that the root cause of all of this undefined behavior is due to an underlying incentive structure that your pet language isn't immune to: you just got to start off clean without a ton of legacy baggage code. You don't have a ridiculously large amount of code written with silly assumptions for 16-bit systems that has been carefully dragged through the ages into 64-bit, and you have a ton of explicit syntax already to prevent a lot of the other brutal cases in all new code... and there is also (probably) a single implementation of the compiler.

But, one day, 20 years from now, after a billion lines of code have been written in your language, there are four big implementations of the compiler (only one of which still supporting legacy ARM CPUs, as everyone has since moved on to KR7 or whatever is hip in 2040), and you've made a ton of sad choices in an attempt to hold your community together, someone is going to weigh a cost/benefit analysis and decide "keeping existing code working but making it faster is more important--and wins us more marketshare--than telling people how to port their code to the new era".

Now, I don't know what these cases are going to be--maybe something super complex involving how massively parallel execution environments handle order of execution for old code designed to run on a classic computer core, or how to map explicit integer representations into prime-order field elements for zero-knowledge execution, or what to do with information leakage during homomorphic encryption--but it is going to happen, and we are going to repeat all of these "quirks mode vs. strict mode" discussions each generation for every actually-important piece of technology, just as we have for standards even so far afield to "programming" as HTML and TCP.

sokoloff wrote at 2021-12-04 20:21:17:

> like how older code uses "int" as the counter for every single for loop, despite this being an incredibly stupid type for this purpose

Is it? I always internalized int as “the most natural sized number for the platform”. For a lot of uses, that’s a pretty reasonable choice, despite there more specific/constrained types available now.

msbarnett wrote at 2021-12-04 22:14:33:

Its not about the size, its about the signedness being presumably not natural for an index, and the potential for having very inconvenient consequences for performance on modern CPU architectures if you were not able to assume the index never goes negative due to a wrap.

Basically, to get decent perf out of C loops, arithmetic overflow _has to_ be left as UB for int. If the standard index variable type had historically been unsigned and defined to terminate on overflow or something, you could make wrap-on-overflow standard for int without killing the performance of hot inner loops.

cpeterso wrote at 2021-12-04 21:45:16:

int is signed and signed int overflow is UB. If you use an int as a loop counter, the compiler can optimize the loop assuming the loop counter will never overflow because, surely, a programmer would never invoke UB. :) If the loop counter does overflow, then who knows what bad things will happen. For example, a simple loop like “for (int i = 0; i >= 0; i++) { … }” might become an infinite loop, crash, execute random code, or never execute in the first place.

sokoloff wrote at 2021-12-04 22:29:41:

I think it's defined to execute at least until overflow, is it not? (So, even assuming a 1-bit int, it's going to execute at least once I think.)

seoaeu wrote at 2021-12-04 21:21:51:

The problem is that the "most natural sized number for the platform" was 32-bits for so long on major platforms, that now everyone assumes that changing that will break too much for it to be worth switching

eximius wrote at 2021-12-04 18:26:44:

I don't follow. This demonstrates that we use UB, not that we need it?

Jweb_Guru wrote at 2021-12-04 19:25:55:

The point he is making is that trying to optimize C without any UB at all is basically impossible, even basic stuff like register allocation. If you're willing to forego that much performance just for the nebulous goal of producing "platform-specific UB" (which in practice can be just as damaging as the "regular" kind of UB in most cases) to simplify reasoning about your program, do you really need to be writing in C? Or would another, memory-safe language suffice? For that matter, if you really only care about what happens on the platform you're on, why not write in assembler and avoid having to worry about an optimizer at all? After all, "platform-specific" UB will otherwise vary across systems, leading to the same kind of unexpected behavior when you port your program (e.g. weak memory bugs in code designed for x86-64 machines that only surface on ARM).

(And to be clear--I think the article is being incredibly diplomatic in saying "there isn't enough research" into compilers that don't optimize based on UB. It would be absolutely shocking if a non-UB-based compiler were able to come anywhere close to modern C compilers on either real-world programs or toy benchmarks, because almost every avenue of optimization would be closed to them. When any out of bounds write from anywhere in your code, including another thread, can modify a variable on the stack, it's not even usually possible to coalesce or rearrange writes, or remove redundant reads! These sorts of things are the bread and butter of compiler optimizations. Hell, are we even allowed to inline function calls? It might break out of bounds writes to the saved stack pointer!).

throwuxiytayq wrote at 2021-12-04 21:28:26:

> It would be absolutely shocking if a non-UB-based compiler were able to come anywhere close to modern C compilers on either real-world programs or toy benchmarks, because almost every avenue of optimization would be closed to them.

Modern C# has next to no undefined behavior, you can disregard the managed features and write low level code with it, and it JIT-compiles nearly as efficiently as C.

Unity has an LLVM-based compiler for a subset of C# that emits _extremely clean_ native assembly.

These days there's no real reason for languages to be as ugly as C.

Jweb_Guru wrote at 2021-12-04 22:09:07:

> Modern C# has next to no undefined behavior, you can disregard the managed features and write low level code with it, and it JIT-compiles nearly as efficiently as C.

The calculus for memory-safe languages is completely different. They are not allowed to perform arbitrary, unchecked, out-of-bounds writes, so optimizations assuming those writes don't exist are sound without UB. If you mix them with an unsafe language or component that can use raw pointers, that part of the language has a lot of the same UB that C does, or else it would invalidate optimizations in the managed part of the language. If you've ever seen C++ code that needs to interact with a managed runtime, you're well aware of how careful the unmanaged part has to be to avoid treading on runtime invariants.

Anyway I'm not defending C here. I'm defending UB. It exists in the language for a reason, which is that it's not meaningfully possible to optimize programs without it. If people don't want to deal with it, they should use another language that isn't C. That doesn't mean that the _amount_ of UB in C couldn't be greatly reduced, but the article is responding to people who claim that a modern optimizing C compiler could reasonably avoid non-"platform specific" UB altogether.

eximius wrote at 2021-12-04 20:44:29:

UB that just says "this is an axiom the compiler makes" is fine. Ignoring it as a possibility is fine. Lighting your program on fire when it _knows_ you've done it instead of just telling you or doing something consistent is not.

Anyway, if this is basically just "C is bad", whatever, I'm not using C for anything. I'd use Zig or Rust.

Jweb_Guru wrote at 2021-12-04 22:13:54:

> UB that just says "this is an axiom the compiler makes" is fine.

That's literally what UB is.

If the compiler makes any optimizations based on those axioms, and the axioms are violated, you'll end up with unexpected behaviors ("lighting your program on fire") that are crazy hard to debug. So saying "the compiler can have UB, but it just shouldn't make weird optimizations based on that" is a little hard to justify, or else you get absurd results like what I outlined in the post above. A lot of beginning C programmers _do_ find it strange that they can't index past the end of a variable and expect to always hit the next variable on the stack, for example, but experienced C programmers aren't. Being able to rearrange, reuse, and register allocate variable slots is one of the most fundamental optimizations out there.

Now, imagine if compilers were never allowed to reorder writes and loads because another thread might potentially update them... pretty crazy, right? Well, a lot of the "benign data races" people yell at compilers for screwing with, are caused by exactly that kind of reasoning, assuming that writes in one thread will always be seen in the same order by other threads without any explicit synchronization or ordering primitives. And that's not even necessarily a "messing with what the hardware does" thing, because there's hardware out there that will reorder ordinary loads and stores for you!

masklinn wrote at 2021-12-04 18:47:25:

Indeed.

Though one of the confusing issue with UBs is that the _purpose_ of the UB varies.

Some UBs are really close cousins of IBs like UBs on signed overflow: it could be IB but was probably found not worth the status, so was left undefined.

Others are about the intersection of efficiency and lack of expressivity, like dereferencing null pointers: implicitly checking each and every access was (likely) considered too costly, but the language has no way of expressing nullability statically.

So given a language with C-type semantics, you don't _need_ the first UB (though it can certainly be leveraged by compilers), while the second one is much harder to get rid off.

marginalia_nu wrote at 2021-12-04 21:18:28:

The is-ought problem is tricky like that.

larsrc wrote at 2021-12-04 19:41:56:

Gödel says you do.

Jweb_Guru wrote at 2021-12-04 19:54:57:

Gödel's incompleteness theorem does not say that you cannot create a programming language (even a Turing-complete one) without undefined behavior. I'd love to hear how you came to that conclusion.