💾 Archived View for dioskouroi.xyz › thread › 29368529 captured on 2021-11-30 at 20:18:30. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

Floating point visually explained (2017)

Author: luisha

Score: 285

Comments: 55

Date: 2021-11-28 12:49:44

Web Link

________________________________________________________________________________

Sirenos wrote at 2021-11-28 15:16:59:

Why is everyone complaining about people finding floats hard? Sure, scientific notation is easy to grasp, but you can't honestly tell me that it's trivial AFTER you consider rounding modes, subnormals, etc. Maybe if hardware had infinite precision like the real numbers nobody would be complaining ;)

One thing I dislike in discussions about floats is this incessant focus on the binary representation. The representation is NOT the essence of the number. Sure it matters if you are a hardware guy, or need to work with serialized floats, or some NaN-boxing trickery, but you can get a perfectly good understanding of binary floats by playing around with the key parameters of a floating point number system:

- precision = how many bits you have available

- exponent range = lower/upper bounds for exponent

- radix = 2 for binary floats

Consider listing out all possible floats given precision=3, exponents from -1 to 1, radix=2. See what happens when you have a real number that needs more than 3 bits of precision. What is the maximum rounding error using different rounding strategies? Then move on to subnormals and see how that adds a can of worms to underflow scenarios that you don't see in digital integer arithmetic. For anyone interested in a short book covering all this, I would recommend "Numerical Computing with IEEE Floating Point Arithmetic" by Overton [1].

[1]:

https://cs.nyu.edu/~overton/book/index.html

GuB-42 wrote at 2021-11-28 17:24:52:

I think the binary representation is the essence of floating point numbers, and if you go beyond the "sometimes, the result is slightly wrong" stage, you have to understand it.

And so far the explanation in the article is the best I found, not least because subnormal numbers appear naturally.

There is a mathematical foundation behind it of course, but it is not easy for a programmer like me. I think it is better to think in term of bits and the integers they make, because that's what the computer sees. And going this way, you get NaN-boxing and serialization as a bonus.

Now, I tend to be most comfortable with a "machine first", bottom-up, low level approach to problems. Mathematical and architectural concepts are fine and all, but unless I have some idea about how it looks like in memory and the kind of instructions being run, I tend to feel lost. Some people may be more comfortable with high level reasoning, we don't all have the same approach, that's what I call real diversity and it is a good thing.

Sirenos wrote at 2021-11-28 20:29:09:

Sorry, I didn't mean to downplay the value of using concrete examples. I absolutely agree that everyone learns better from concrete settings, which is why my original comment fixed the parameters for people to play with. I was referring more to the discussions of how exponents are stored biased, the leading bit in the mantissa is implied = 1 (except for subnormals), and so on. All these are distracting features that can (and should) be covered once the reader has a strong intuition of the more fundamental aspects.

xyzzy_plugh wrote at 2021-11-28 15:35:45:

The binary form is important to understand the implementation details. You even mention underflow. It's difficult for most people to initially understand why you can't store a large number that can be represented by an equivalent size integer as a float accurately.

The binary form handily demonstrates the limitations. Understanding the floating point instructions is kinda optional but still valuable.

Otherwise everyone should just use varint-encoded arbitrary precision numbers.

colejohnson66 wrote at 2021-11-28 15:55:28:

It also explains why 0.1+0.2 is _not_ 0.3. With binary IEEE-754 floats, none of those can be represented exactly[a]. With decimal IEEE-754 floats, it's possible, but the majority of hardware people interact with works on binary floats.

[a]: Sure, if you `console.log(0.1)`, you'll get 0.1, but it's not possible to express it in binary _exactly_; only after rounding. 0.5, however, _is_ exactly representable.

jacobolus wrote at 2021-11-28 17:06:58:

      Python 3.9.5
    >>> 0.1.hex()
    '0x1.999999999999ap-4'
    >>> 0.2.hex()
    '0x1.999999999999ap-3'
    >>> (0.1 + 0.2).hex()
    '0x1.3333333333334p-2'
    >>> 0.3.hex()
    '0x1.3333333333333p-2'

colejohnson66 wrote at 2021-11-28 22:13:44:

But they are repeating. So, by definition, they are not exactly representable in a (binary) floating point system. Again, that’s why 0.1 + 0.2 is not 0.3 in binary floating point.

jacobolus wrote at 2021-11-29 07:18:41:

These are not "repeating". This is showing the exact binary representation of the nearest double precision value in each case.

jhgb wrote at 2021-11-28 16:02:51:

> It's difficult for most people to initially understand why you can't store a large number that can be represented by an equivalent size integer as a float accurately.

Because you don't have all the digits available just for the mantissa? That seems quite intuitive to me, even if you don't know about the corner cases of FP. This isn't one of them.

bee_rider wrote at 2021-11-28 18:30:20:

I was going to respond with something like this. I think for getting a general flavor*, just talking about scientific notation with rounding to a particular number of digits at every step is fine.

I guess -- the one thing that explicitly looking at the bits does bring to the table, is the understanding that (of course) the number of bits or digits in the mantissa must be less than the number of bits or digits in an equivalent length integer. Of course, this is pretty obvious if you think about the fact that they are the same size in memory, but if we're talking about sizes in memory, then I guess we're talking about bits implicitly, so may as well make it explicit.

* actually, much more than just getting a general flavor, most of the time in numerical linear algebra stuff the actual bitwise representation is usually irrelevant, so you can get pretty far without thinking about bits.

bogomipz wrote at 2021-11-29 02:28:12:

>"Sure it matters if you are a hardware guy, or need to work with serialized floats, or some NaN-boxing trickery"

Might you or someone else elaborate on what "NaN-boxing trickery" is and why such "trickery" is needed?

remram wrote at 2021-11-29 02:33:21:

There are 13 bits that are unused in the NaN representation, so you can put data in there without impeding your ability to represent every floating point value.

For example, if you are designing a scripting language interpreter, you can make every value a floating point number and use some of the NaNs to represent pointers, booleans, etc.

sharikous wrote at 2021-11-28 15:45:13:

As one who understands floats I really wish there was a better notation for literals, the best would be a floating literal in binary representation.

For integers you can write 0x0B, 11, 0b1011 and have a very precise representation

For floats you write 1e-1 or 0.1 and you get an ugly truncation. If it were possible to write something like 1eb-1 (for 0.5) and 1eb-2 (for 0.25)... people would be incentivate to use nice negative power of 2 floats, which are much less error prone than ugly base conversions.

This way you can overcame the fears around floats being nonexact and start writing more accurate tests (bit per bit) in many cases

jacobolus wrote at 2021-11-28 15:55:20:

> _better notation for literals [...] something like 1eb-1 (for 0.5) and 1eb-2 (for 0.25)_

There are floating point hex literals. These can be written as 0x1p-1 == 0.5 and 0x1p-2 == 0.25.

You can use them in C/C++, Java, Julia, Swift, ..., but they are not supported everywhere.

https://observablehq.com/@jrus/hexfloat

Aardwolf wrote at 2021-11-28 19:16:27:

C++ hex floats are an interesting combination of 3 numeric bases in one!

the mantissa is written in base 16

the exponent is written in base 10

the exponent itself is a power of 2 (not of 16 or 2), so that's base 2

One can only wonder how that came to be. I think they chose base 10 for the exponent to allow using the 'f' suffix to denote float (as opposed to double)

adgjlsfhk1 wrote at 2021-11-28 16:50:36:

Julia is missing 32 bit and 16 bit hex floats unfortunately.

simonbyrne wrote at 2021-11-28 20:05:28:

You can just wrap the literal in a conversion function, eg Float32(0x1p52), which should get constant propagated at compile time.

adgjlsfhk1 wrote at 2021-11-28 20:46:49:

I know, it just isn't as nice to read.

wruza wrote at 2021-11-28 17:03:54:

Are you asking for pow(2, n)? Or for `float mkfloat(int e, int m)` which literally implements the given formula? I doubt that the notation you’re suggesting will be used in code, except for really rare bitwise cases.

The initial precision doesn’t really matter, because if you plan to use this value in a computation, it will quickly accumulate an error, which you have to deal with anyway. There are three ways to deal with it: 1) ignore it, 2) account for it, 3) use numeric methods which retain it in a decent range. You may accidentally (1)==(3), but the problem doesn’t go away in general.

stncls wrote at 2021-11-28 15:58:50:

I'm not sure this is what you are asking for, but would this be suitable?

> ISO C99 and ISO C++17 support floating-point numbers written not only in the usual decimal notation, such as 1.55e1, but also numbers such as 0x1.fp3 written in hexadecimal format. [...] The exponent is a decimal number that indicates the power of 2 by which the significant part is multiplied. Thus ‘0x1.f’ is 1 15/16, ‘p3’ multiplies it by 8, and the value of 0x1.fp3 is the same as 1.55e1.

https://gcc.gnu.org/onlinedocs/gcc/Hex-Floats.html

mgaunard wrote at 2021-11-28 13:44:38:

I don't understand how M * 2^E (modulo small representation details) is difficult to grasp.

Then you have decimal types as M * 10^E.

It's certainly much clearer than the windowing and bucketing in this article.

cturner wrote at 2021-11-28 14:00:48:

His approach works well for me. I don't retain arbitrary facts. At least part of this is a fear that if I don't properly understand its dynamics I will misapply it, better to discard it.

The windowing explanation shows me a path from the intent of the designer through to the implementation (the algorithm). Now I can retain that knowledge.

marcosdumay wrote at 2021-11-28 14:31:36:

> I don't retain arbitrary facts.

I still fail to understand what is arbitrary in scientific notation. The designers almost certainly didn't think in terms of windows when creating the data types, they probably thought about the mantissa and exponent the way it's usually explained, and maybe about information density (entropy per bit) when comparing with other possibilities.

Anyway, if it helps, great. Different explanations are always welcome. But this one is post-fact.

cturner wrote at 2021-11-29 08:39:09:

> I still fail to understand what is arbitrary in scientific notation

I used the word arbitrary, but did not say that scientific notation was arbitrary.

I will try to explain another way. The equation is an assertion. For me, it does not plainly follow from the underlying problem. Once the author has introduced the windowing approach, the equation follows easily for me.

> almost certainly [..] they probably [..] Different explanations are always welcome. But this one is post-fact.

By your own words, it is not clear that it is. It is clear that you doubt that the author of the system was thinking in terms of windows and buckets.

wruza wrote at 2021-11-28 17:27:27:

Ahem, exponential forms go back to Archimedes, and no design (or intent thereof) assumed windows or offsets in FP. It’s just a fixed-precision floating-point notation, no more no less. The problem persists with decimal tax calculations like $11111.11 x 12% == $1333.33|32, where 0.0032 is lost in a monetary form. It also persists with on-paper calculations because there may be no room for all the digits that your algorithm produces (think x/7) and you have to choose your final acceptable precision.

_At least part of this is a fear that if I don't properly understand its dynamics I will misapply it_

The explanation that you like can’t save from it either. It’s not something you think through and just write the correct code. Quick exercise with the “new” knowledge you’ve got: you have an array of a million floats, add them up correctly:

ribit wrote at 2021-11-28 14:27:21:

I think windowing and bucketing are a useful conceptual model for understanding the precision of the encoding space. I understand how FP values work in practice but this article really hits the nail on it's head when explaining the less tangible aspects of FP encoding.

vlmutolo wrote at 2021-11-28 22:30:10:

The "default" formula as presented in the article seems… strange. Is this really how it's normally taught?

(-1)^S * 1.M * 2^(E - 127)

This seems unnecessarily confusing. And 1.M isn't notation I've seen before. If we expand 1.M into 1 + M : 0 < M < 1, then we pretty quickly arrive at the author's construction of "windows" and "offsets".

(-1)^S * (1 + M) * 2^(E - 127)
    (-1)^S * 2^log_2(1 + M) * 2^(E - 127)
    
    let F := log_2(1 + M)
    0 < F < 1
    
    Note: [0 < M < 1] implies [1 < M + 1 < 2] implies [0 < log_2(1+M) < 1]

    (-1)^S * 2^(E - 127 + F)

Since we know F is between 0 and 1, we can see that F controls where the number lands between 2^(E - 127) and 2^(E - 127 + 1) (ignoring the sign). It's the "offset".

dang wrote at 2021-11-28 16:51:33:

Some past related threads:

_Floating Point Visually Explained (2017)_ -

https://news.ycombinator.com/item?id=23081924

- May 2020 (96 comments)

_Floating Point Visually Explained (2017)_ -

https://news.ycombinator.com/item?id=19084773

- Feb 2019 (17 comments)

_Floating Point Visually Explained_ -

https://news.ycombinator.com/item?id=15359574

- Sept 2017 (106 comments)

defaultname wrote at 2021-11-28 15:15:08:

Several years ago I made a basic, interactive explanation of the same-

https://dennisforbes.ca/articles/understanding-floating-poin...

At the time it was to educate a group I was working with about why their concern that "every number in JavaScript is a double" doesn't mean that 1 is actually 1.000000001 (e.g. everyone knows that floating point numbers are potentially approximations, so there is a widespread belief that every integer so represented is the same).

ryan-duve wrote at 2021-11-28 13:03:02:

As someone who never learned the "dreadful notation... discouraging legions of programmers", this was really helpful! The "canonical" equation doesn't mean anything to me and the window/offset explanation does.

I'm sure the author intended to keep the article short, but I think it would benefit from more examples, including a number like 5e-5 or something. It isn't clear to me how the window/offset explains that.

Edit: to clarify, I do not understand how the floating point representation of 0.00005 is interpreted with windows/offsets.

ricardobeat wrote at 2021-11-28 13:39:04:

5e-5 is not really related to FP, it’s just scientific notation. The number itself is still stored as shown in the post, it’s just being printed in a more compact form.

The number after _e_ is a power of 10:

  5e2  = 5 * 10^2  = 5 * 100 = 500
    5e-5 = 5 * 10^-5 = 5 / 100000 = 0.00005

Once you internalize this, you just read the exponent as “x zeroes to the left”.

taeric wrote at 2021-11-28 13:41:21:

Scientific notation is easily seen as just floating point in base ten, though. Had the same rules about leading 1 and such. Add in significant digits, and you have most all of the same concerns.

boibombeiro wrote at 2021-11-28 13:44:08:

Scientific notation IS a floating point.

ToddWBurgess wrote at 2021-11-28 14:02:53:

See the Intel math coprocessor was a blast from the past. I sold a few of them and installed a few of them when I used to work in computer stores.

TT-392 wrote at 2021-11-28 20:08:32:

It probably differs a lot per person, and I usually do find graphical explanations of stuff easyer to graph. But that formula was way easyer to understand for me than the other explanation. Maybe it is related to the fact that I am pretty ok at math, but I got kinda bad dyslexia.

kuharich wrote at 2021-11-28 15:16:09:

Past comments:

https://news.ycombinator.com/item?id=15359574

https://news.ycombinator.com/item?id=19084773

wodenokoto wrote at 2021-11-28 15:27:09:

I was kinda hoping for a visualization of which numbers exists in floating point.

While I always new about

0.1 + 0.2 -> 0.30000000000000004

it was still kind of an epiphany realizing that floating point numbers don’t so much have rounding errors as they are simply discreet numbers.

You can move from one float to the next, which is a meaningfull operation on discreet numbers like integers, but not continuous numbers like rational and irrational numbers.

I also feel like that is a much more important take away from a user of floating points knowing what the mantissa is.

bspammer wrote at 2021-11-28 16:15:52:

I highly recommend watching this video, which explains floats in the context of Mario 64:

https://www.youtube.com/watch?v=9hdFG2GcNuA

Games are a great way to explain floats because they're so visual. Specifically, check out 3:31, where the game has been hacked so that the possible float values in one of the movement directions is quite coarse.

(If you're curious what a PU is, and would like some completely useless knowledge, this video is also an absolute classic and very entertaining:

https://www.youtube.com/watch?v=kpk2tdsPh0A

)

rlanday wrote at 2021-11-28 16:53:20:

> You can move from one float to the next, which is a meaningfull operation on discreet numbers like integers, but not continuous numbers like rational and irrational numbers.

What do you mean by continuous? Obviously if you take a number line and remove either the rational or irrational numbers, you will end up with infinitely many holes.

The thing that makes floating point numbers unique is that, for any given representation, there are actually only _finitely_ many. There’s a largest possible value and a smallest possible value and each number will have gaps on either side of it. I think you meant that the rationals and irrationals are _dense_ (for any two distinct numbers, you can find another number between them), which is also false for floating-point numbers.

wodenokoto wrote at 2021-11-29 12:48:09:

> What do you mean by continuous?

I meant continuous in the “mathematical sense”. But maybe continuous is actually statistical term, and dense is the correct mathematical term.

hoseja wrote at 2021-11-29 13:10:10:

I dunno, I feel it better to consider floats to not be a single discrete number but an interval.

brianzelip wrote at 2021-11-28 16:39:46:

Wild, just got done listening to Coding Blocks podcast episode on data structure primitives where they go in depth on floating point, fixed point, binary floating point, and more. Great listen! See around 54min mark -

https://www.codingblocks.net/podcast/data-structures-primiti...

baryphonic wrote at 2021-11-28 17:43:57:

I found a tool[0] that helps me debug potential floating point issues when they arise. This one has modes for half-, single- and double-precision IEEE-754 floats, so I can deal with the various issues when converting between 64-bit and 32-bit floats.

[0]

https://news.ycombinator.com/item?id=29370883

bool3max wrote at 2021-11-28 20:44:27:

This is _the_ article that finally helped me grasp and understand IEEE-754. An excellent read.

dirkk0 wrote at 2021-11-28 16:54:33:

I also found this video from SimonDev educative (and entertaining):

https://www.youtube.com/watch?v=Oo89kOv9pVk

darkest_ruby wrote at 2021-11-28 22:22:15:

Having read this article I understand floating point even less now

jacksonkmarley wrote at 2021-11-28 17:09:25:

Anyone got a good reference for pratical implications of floating point vs fixed point in measurement calculations (especially timing measurements)? Gotchas, rules of thumb, best practice etc?

bullen wrote at 2021-11-28 14:14:02:

Why is there no float type with linear precision?

adgjlsfhk1 wrote at 2021-11-28 14:15:46:

that's called fixed point. there isn't hardware for it because it is cheap to make in software using integer math.

bullen wrote at 2021-11-28 14:48:02:

But wouldn't it be sweet to have SIMD acceleration for fixed point?

Or is that something you can do with integers and SIMD today?

Say 4x4 matrix multiplication?

adwn wrote at 2021-11-28 15:10:21:

Having implemented a nested PID controller using fixed-point arithmetic in an FPGA, I can tell you that fixed-point is a pain in the ass, and you only use it when floating-point is too slow, too large (in terms of on-chip resources), or consumes too much power, or when you _really_ need to control the precision going into and out of an arithmetic operation.

adgjlsfhk1 wrote at 2021-11-28 15:02:03:

your add is just an integer add. your multiply is a mul_hi, 2 shifts, 1 add and 1 mul.

mhh__ wrote at 2021-11-28 15:47:18:

Are you asking if integer SIMD exists? (It does)

djmips wrote at 2021-11-28 14:59:25:

Some DSP chips had hardware for fixed point. I think it's a shame that C never added support for fixed point.

kevin_thibedeau wrote at 2021-11-28 23:02:42:

There was a draft and GCC supports it in stdfix.h. The downside is that the types have limited integer range since they're tailored for DSP applications where values are kept scaled between +/-1.0.