đŸ Archived View for dioskouroi.xyz âș thread âș 29368529 captured on 2021-11-30 at 20:18:30. Gemini links have been rewritten to link to archived content
-=-=-=-=-=-=-
________________________________________________________________________________
Why is everyone complaining about people finding floats hard? Sure, scientific notation is easy to grasp, but you can't honestly tell me that it's trivial AFTER you consider rounding modes, subnormals, etc. Maybe if hardware had infinite precision like the real numbers nobody would be complaining ;)
One thing I dislike in discussions about floats is this incessant focus on the binary representation. The representation is NOT the essence of the number. Sure it matters if you are a hardware guy, or need to work with serialized floats, or some NaN-boxing trickery, but you can get a perfectly good understanding of binary floats by playing around with the key parameters of a floating point number system:
- precision = how many bits you have available
- exponent range = lower/upper bounds for exponent
- radix = 2 for binary floats
Consider listing out all possible floats given precision=3, exponents from -1 to 1, radix=2. See what happens when you have a real number that needs more than 3 bits of precision. What is the maximum rounding error using different rounding strategies? Then move on to subnormals and see how that adds a can of worms to underflow scenarios that you don't see in digital integer arithmetic. For anyone interested in a short book covering all this, I would recommend "Numerical Computing with IEEE Floating Point Arithmetic" by Overton [1].
[1]:
https://cs.nyu.edu/~overton/book/index.html
I think the binary representation is the essence of floating point numbers, and if you go beyond the "sometimes, the result is slightly wrong" stage, you have to understand it.
And so far the explanation in the article is the best I found, not least because subnormal numbers appear naturally.
There is a mathematical foundation behind it of course, but it is not easy for a programmer like me. I think it is better to think in term of bits and the integers they make, because that's what the computer sees. And going this way, you get NaN-boxing and serialization as a bonus.
Now, I tend to be most comfortable with a "machine first", bottom-up, low level approach to problems. Mathematical and architectural concepts are fine and all, but unless I have some idea about how it looks like in memory and the kind of instructions being run, I tend to feel lost. Some people may be more comfortable with high level reasoning, we don't all have the same approach, that's what I call real diversity and it is a good thing.
Sorry, I didn't mean to downplay the value of using concrete examples. I absolutely agree that everyone learns better from concrete settings, which is why my original comment fixed the parameters for people to play with. I was referring more to the discussions of how exponents are stored biased, the leading bit in the mantissa is implied = 1 (except for subnormals), and so on. All these are distracting features that can (and should) be covered once the reader has a strong intuition of the more fundamental aspects.
The binary form is important to understand the implementation details. You even mention underflow. It's difficult for most people to initially understand why you can't store a large number that can be represented by an equivalent size integer as a float accurately.
The binary form handily demonstrates the limitations. Understanding the floating point instructions is kinda optional but still valuable.
Otherwise everyone should just use varint-encoded arbitrary precision numbers.
It also explains why 0.1+0.2 is _not_ 0.3. With binary IEEE-754 floats, none of those can be represented exactly[a]. With decimal IEEE-754 floats, it's possible, but the majority of hardware people interact with works on binary floats.
[a]: Sure, if you `console.log(0.1)`, you'll get 0.1, but it's not possible to express it in binary _exactly_; only after rounding. 0.5, however, _is_ exactly representable.
Python 3.9.5 >>> 0.1.hex() '0x1.999999999999ap-4' >>> 0.2.hex() '0x1.999999999999ap-3' >>> (0.1 + 0.2).hex() '0x1.3333333333334p-2' >>> 0.3.hex() '0x1.3333333333333p-2'
But they are repeating. So, by definition, they are not exactly representable in a (binary) floating point system. Again, thatâs why 0.1 + 0.2 is not 0.3 in binary floating point.
These are not "repeating". This is showing the exact binary representation of the nearest double precision value in each case.
> It's difficult for most people to initially understand why you can't store a large number that can be represented by an equivalent size integer as a float accurately.
Because you don't have all the digits available just for the mantissa? That seems quite intuitive to me, even if you don't know about the corner cases of FP. This isn't one of them.
I was going to respond with something like this. I think for getting a general flavor*, just talking about scientific notation with rounding to a particular number of digits at every step is fine.
I guess -- the one thing that explicitly looking at the bits does bring to the table, is the understanding that (of course) the number of bits or digits in the mantissa must be less than the number of bits or digits in an equivalent length integer. Of course, this is pretty obvious if you think about the fact that they are the same size in memory, but if we're talking about sizes in memory, then I guess we're talking about bits implicitly, so may as well make it explicit.
* actually, much more than just getting a general flavor, most of the time in numerical linear algebra stuff the actual bitwise representation is usually irrelevant, so you can get pretty far without thinking about bits.
>"Sure it matters if you are a hardware guy, or need to work with serialized floats, or some NaN-boxing trickery"
Might you or someone else elaborate on what "NaN-boxing trickery" is and why such "trickery" is needed?
There are 13 bits that are unused in the NaN representation, so you can put data in there without impeding your ability to represent every floating point value.
For example, if you are designing a scripting language interpreter, you can make every value a floating point number and use some of the NaNs to represent pointers, booleans, etc.
See also
http://wingolog.org/archives/2011/05/18/value-representation...
As one who understands floats I really wish there was a better notation for literals, the best would be a floating literal in binary representation.
For integers you can write 0x0B, 11, 0b1011 and have a very precise representation
For floats you write 1e-1 or 0.1 and you get an ugly truncation. If it were possible to write something like 1eb-1 (for 0.5) and 1eb-2 (for 0.25)... people would be incentivate to use nice negative power of 2 floats, which are much less error prone than ugly base conversions.
This way you can overcame the fears around floats being nonexact and start writing more accurate tests (bit per bit) in many cases
> _better notation for literals [...] something like 1eb-1 (for 0.5) and 1eb-2 (for 0.25)_
There are floating point hex literals. These can be written as 0x1p-1 == 0.5 and 0x1p-2 == 0.25.
You can use them in C/C++, Java, Julia, Swift, ..., but they are not supported everywhere.
https://observablehq.com/@jrus/hexfloat
C++ hex floats are an interesting combination of 3 numeric bases in one!
the mantissa is written in base 16
the exponent is written in base 10
the exponent itself is a power of 2 (not of 16 or 2), so that's base 2
One can only wonder how that came to be. I think they chose base 10 for the exponent to allow using the 'f' suffix to denote float (as opposed to double)
Julia is missing 32 bit and 16 bit hex floats unfortunately.
You can just wrap the literal in a conversion function, eg Float32(0x1p52), which should get constant propagated at compile time.
I know, it just isn't as nice to read.
Are you asking for pow(2, n)? Or for `float mkfloat(int e, int m)` which literally implements the given formula? I doubt that the notation youâre suggesting will be used in code, except for really rare bitwise cases.
The initial precision doesnât really matter, because if you plan to use this value in a computation, it will quickly accumulate an error, which you have to deal with anyway. There are three ways to deal with it: 1) ignore it, 2) account for it, 3) use numeric methods which retain it in a decent range. You may accidentally (1)==(3), but the problem doesnât go away in general.
I'm not sure this is what you are asking for, but would this be suitable?
> ISO C99 and ISO C++17 support floating-point numbers written not only in the usual decimal notation, such as 1.55e1, but also numbers such as 0x1.fp3 written in hexadecimal format. [...] The exponent is a decimal number that indicates the power of 2 by which the significant part is multiplied. Thus â0x1.fâ is 1 15/16, âp3â multiplies it by 8, and the value of 0x1.fp3 is the same as 1.55e1.
https://gcc.gnu.org/onlinedocs/gcc/Hex-Floats.html
I don't understand how M * 2^E (modulo small representation details) is difficult to grasp.
Then you have decimal types as M * 10^E.
It's certainly much clearer than the windowing and bucketing in this article.
His approach works well for me. I don't retain arbitrary facts. At least part of this is a fear that if I don't properly understand its dynamics I will misapply it, better to discard it.
The windowing explanation shows me a path from the intent of the designer through to the implementation (the algorithm). Now I can retain that knowledge.
> I don't retain arbitrary facts.
I still fail to understand what is arbitrary in scientific notation. The designers almost certainly didn't think in terms of windows when creating the data types, they probably thought about the mantissa and exponent the way it's usually explained, and maybe about information density (entropy per bit) when comparing with other possibilities.
Anyway, if it helps, great. Different explanations are always welcome. But this one is post-fact.
> I still fail to understand what is arbitrary in scientific notation
I used the word arbitrary, but did not say that scientific notation was arbitrary.
I will try to explain another way. The equation is an assertion. For me, it does not plainly follow from the underlying problem. Once the author has introduced the windowing approach, the equation follows easily for me.
> almost certainly [..] they probably [..] Different explanations are always welcome. But this one is post-fact.
By your own words, it is not clear that it is. It is clear that you doubt that the author of the system was thinking in terms of windows and buckets.
Ahem, exponential forms go back to Archimedes, and no design (or intent thereof) assumed windows or offsets in FP. Itâs just a fixed-precision floating-point notation, no more no less. The problem persists with decimal tax calculations like $11111.11 x 12% == $1333.33|32, where 0.0032 is lost in a monetary form. It also persists with on-paper calculations because there may be no room for all the digits that your algorithm produces (think x/7) and you have to choose your final acceptable precision.
_At least part of this is a fear that if I don't properly understand its dynamics I will misapply it_
The explanation that you like canât save from it either. Itâs not something you think through and just write the correct code. Quick exercise with the ânewâ knowledge youâve got: you have an array of a million floats, add them up correctly:
I think windowing and bucketing are a useful conceptual model for understanding the precision of the encoding space. I understand how FP values work in practice but this article really hits the nail on it's head when explaining the less tangible aspects of FP encoding.
The "default" formula as presented in the article seems⊠strange. Is this really how it's normally taught?
(-1)^S * 1.M * 2^(E - 127)
This seems unnecessarily confusing. And 1.M isn't notation I've seen before. If we expand 1.M into 1 + M : 0 < M < 1, then we pretty quickly arrive at the author's construction of "windows" and "offsets".
(-1)^S * (1 + M) * 2^(E - 127) (-1)^S * 2^log_2(1 + M) * 2^(E - 127) let F := log_2(1 + M) 0 < F < 1 Note: [0 < M < 1] implies [1 < M + 1 < 2] implies [0 < log_2(1+M) < 1] (-1)^S * 2^(E - 127 + F)
Since we know F is between 0 and 1, we can see that F controls where the number lands between 2^(E - 127) and 2^(E - 127 + 1) (ignoring the sign). It's the "offset".
Some past related threads:
_Floating Point Visually Explained (2017)_ -
https://news.ycombinator.com/item?id=23081924
- May 2020 (96 comments)
_Floating Point Visually Explained (2017)_ -
https://news.ycombinator.com/item?id=19084773
- Feb 2019 (17 comments)
_Floating Point Visually Explained_ -
https://news.ycombinator.com/item?id=15359574
- Sept 2017 (106 comments)
Several years ago I made a basic, interactive explanation of the same-
https://dennisforbes.ca/articles/understanding-floating-poin...
At the time it was to educate a group I was working with about why their concern that "every number in JavaScript is a double" doesn't mean that 1 is actually 1.000000001 (e.g. everyone knows that floating point numbers are potentially approximations, so there is a widespread belief that every integer so represented is the same).
As someone who never learned the "dreadful notation... discouraging legions of programmers", this was really helpful! The "canonical" equation doesn't mean anything to me and the window/offset explanation does.
I'm sure the author intended to keep the article short, but I think it would benefit from more examples, including a number like 5e-5 or something. It isn't clear to me how the window/offset explains that.
Edit: to clarify, I do not understand how the floating point representation of 0.00005 is interpreted with windows/offsets.
5e-5 is not really related to FP, itâs just scientific notation. The number itself is still stored as shown in the post, itâs just being printed in a more compact form.
The number after _e_ is a power of 10:
5e2 = 5 * 10^2 = 5 * 100 = 500 5e-5 = 5 * 10^-5 = 5 / 100000 = 0.00005
Once you internalize this, you just read the exponent as âx zeroes to the leftâ.
Scientific notation is easily seen as just floating point in base ten, though. Had the same rules about leading 1 and such. Add in significant digits, and you have most all of the same concerns.
Scientific notation IS a floating point.
See the Intel math coprocessor was a blast from the past. I sold a few of them and installed a few of them when I used to work in computer stores.
It probably differs a lot per person, and I usually do find graphical explanations of stuff easyer to graph. But that formula was way easyer to understand for me than the other explanation. Maybe it is related to the fact that I am pretty ok at math, but I got kinda bad dyslexia.
Past comments:
https://news.ycombinator.com/item?id=15359574
,
https://news.ycombinator.com/item?id=19084773
I was kinda hoping for a visualization of which numbers exists in floating point.
While I always new about
0.1 + 0.2 -> 0.30000000000000004
it was still kind of an epiphany realizing that floating point numbers donât so much have rounding errors as they are simply discreet numbers.
You can move from one float to the next, which is a meaningfull operation on discreet numbers like integers, but not continuous numbers like rational and irrational numbers.
I also feel like that is a much more important take away from a user of floating points knowing what the mantissa is.
I highly recommend watching this video, which explains floats in the context of Mario 64:
https://www.youtube.com/watch?v=9hdFG2GcNuA
Games are a great way to explain floats because they're so visual. Specifically, check out 3:31, where the game has been hacked so that the possible float values in one of the movement directions is quite coarse.
(If you're curious what a PU is, and would like some completely useless knowledge, this video is also an absolute classic and very entertaining:
https://www.youtube.com/watch?v=kpk2tdsPh0A
)
> You can move from one float to the next, which is a meaningfull operation on discreet numbers like integers, but not continuous numbers like rational and irrational numbers.
What do you mean by continuous? Obviously if you take a number line and remove either the rational or irrational numbers, you will end up with infinitely many holes.
The thing that makes floating point numbers unique is that, for any given representation, there are actually only _finitely_ many. Thereâs a largest possible value and a smallest possible value and each number will have gaps on either side of it. I think you meant that the rationals and irrationals are _dense_ (for any two distinct numbers, you can find another number between them), which is also false for floating-point numbers.
> What do you mean by continuous?
I meant continuous in the âmathematical senseâ. But maybe continuous is actually statistical term, and dense is the correct mathematical term.
I dunno, I feel it better to consider floats to not be a single discrete number but an interval.
Wild, just got done listening to Coding Blocks podcast episode on data structure primitives where they go in depth on floating point, fixed point, binary floating point, and more. Great listen! See around 54min mark -
https://www.codingblocks.net/podcast/data-structures-primiti...
I found a tool[0] that helps me debug potential floating point issues when they arise. This one has modes for half-, single- and double-precision IEEE-754 floats, so I can deal with the various issues when converting between 64-bit and 32-bit floats.
[0]
https://news.ycombinator.com/item?id=29370883
This is _the_ article that finally helped me grasp and understand IEEE-754. An excellent read.
I also found this video from SimonDev educative (and entertaining):
https://www.youtube.com/watch?v=Oo89kOv9pVk
Having read this article I understand floating point even less now
Anyone got a good reference for pratical implications of floating point vs fixed point in measurement calculations (especially timing measurements)? Gotchas, rules of thumb, best practice etc?
Why is there no float type with linear precision?
that's called fixed point. there isn't hardware for it because it is cheap to make in software using integer math.
But wouldn't it be sweet to have SIMD acceleration for fixed point?
Or is that something you can do with integers and SIMD today?
Say 4x4 matrix multiplication?
Having implemented a nested PID controller using fixed-point arithmetic in an FPGA, I can tell you that fixed-point is a pain in the ass, and you only use it when floating-point is too slow, too large (in terms of on-chip resources), or consumes too much power, or when you _really_ need to control the precision going into and out of an arithmetic operation.
your add is just an integer add. your multiply is a mul_hi, 2 shifts, 1 add and 1 mul.
Are you asking if integer SIMD exists? (It does)
Some DSP chips had hardware for fixed point. I think it's a shame that C never added support for fixed point.
There was a draft and GCC supports it in stdfix.h. The downside is that the types have limited integer range since they're tailored for DSP applications where values are kept scaled between +/-1.0.