[ad_1]

0.1+0.2==0.3 evaluates to false in most programming languages because the result is somewhat 0.30000004, what’s the reason behind this?

In: Engineering

[ad_2]

Because programming languages are giving you an approximation of decimal numbers. (There’s a piece of hardware called a [Floating Point Unit](https://en.wikipedia.org/wiki/Floating-point_unit) in most systems that *also* has these limitations, so in many ways it’s just lazily giving you access to that; but the real question is “What sparked the design of this unit anyway, seems a bit of an odd way to design something?!))

Floating Point numbers span a huge expanse, at the expense of being accurate.

They take approximately the form:

x * 2^y

similar to “scientific notation” something you may remember from your school days:

x * 10^y

The other limitation is you have a finite number of places for each number, let’s say 8ish (in real life, most of the floating point number is devoted to `x` [called a ‘mantissa’] rather than the exponent). Some numbers can’t be represented this way. For base 10, consider 1/3. You can’t actually represent it in scientific notation, you’re left with:

33333333 * 10^-8

but when you perform operations on that, it won’t come out as you expect: 1/3 + 1/3 + 1/3 = .9999999

Now floating point is a little different than that, it’s:

(1.0 + M) x 2^(E-127)

(in addition there’s a sign bit that can be thought of as “is negative”)

.1 (or 1/10) is like 1/3 in floating point. It’s one of these numbers that exists logically, but can’t be represented. Like .3333… in decimal, 0.1 in binary is 0.0**0011…** (sorry for lack of reddit formatting, but the bold bit is repeating as the threes were). This error in representation, flows through all your operations.

However, you’re trading this small inaccuracies for a much larger range of numbers available. If you naively used your 32bits to do 16-bit-Integer-before-decimal.16-bit-Integer-after-decimal, you’d get about **4.6 decimal digits** in each direction (and that’s *unsigned*). Going floating point, you get **37.3 digits** in some situations. (Where .3 digits mean you can run that decimal place up to about 3, but not all higher)

So, basically the idea is instead of being able to represent up to 65536, you can represent up to 340282350000000000000000000000000000000, just not very accurately.

Most languages have Fixed, Decimal, or Money types to accurately represent decimals with values where these small inaccuracies will ruin your day.

Computers do not work with the decimal system but they work in the binary system. What you ask it for looks perfectly fine with the decimal system. 0.1+0.2 is interpreted with only whole number as 1/10 + 2/10 which is easy to calculate. However try to convert 0.1 into a whole fraction in a power of two it does not quite work. You may start with 1.6/16 but that is not a whole number. Even 3.2/32 does not work. You may go up to 102.4/1024 which still does not work. You may even spot the pattern and figure out that it does not work for any fraction. At some point the computer just gives up as it runs out of digits. The problem then is of course that it end up having to round the number to a certain number of binary digits. So then you end up with rounding errors in your calculations.

This is a problem introduced by what we call **floating-point numbers**, or simply “**floats**” for short. Another answer here pretty succinctly states what the problem is broadly, I’ll delve into a little more detail about why floats exist in this way (while keeping it relatively simple).

I’m going to assume you’re readily familiar with what an **integer** is in the context of programming and how it’s stored. They’re whole numbers that are allotted some fixed number of bits to represent the number. The amount of bits you give and the exact method they’re encoded in will determine how big or small of numbers you can store. Most computers these days will store a typical “int” as a 32-bit value. This can store any number between 0 and 4,294,967,295. You can optionally choose to sacrifice one of the bits to serve as a negative (creating a “signed” integer), which will halve the amount of positive values you have access to, but then allow you to store just as many negative numbers too.

If you want to store a *fractional* value (not a whole number), one way you could choose to do that would be to store a literal fraction, i.e., a numerator and a denominator. 1 and 10 to represent 1/10, for example. If you do it this way, the error you spot would never occur as long as neither piece overflowed in any way. One problem with something like this is that implementations of this approach tend to be very wasteful of bits… imagine using two 32-bit integers to store a single fraction. You now are taking double the bits to store a fractional value than you are an integer value. You could cut each in half to make the composite the same size, but that vastly reduces the values that your numerator and denominator can be. These implementations also tend to be very inefficient in computations, since the computer has to keep track of which one is the numerator and which one is the denominator, and probably has to do a lot of register shuffling to keep the two from getting mixed up when performing math operations.

To solve both of these, **floats** were created as a compromise. They have a fixed number of bits, usually 32 (64 bit floats exist and are often called “doubles”, meaning “double-precision”). The bits are essentially split into two pieces, the “base” and the “mantissa”. The “base” is, more or less, just a whole number that uses *only* the minimum number of bits necessary to describe its value. All other bits are assigned to the mantissa, which for the purposes of this ELI5 can simply be thought of as “the magic decimal garbage”. **NOTE** that the actual way the number is stored in memory isn’t actually an int followed by extra garbage that just denotes the fractional piece; that would still give us the computational speed problem. The real implementation is far more convoluted than that, but thinking of it this way for now makes it simpler to digest.

The *float* type gets its name because of this key fact that the division line between the two pieces “floats” around, not existing in any one specific place but instead varying based on what number it’s actually storing. If the base is a large number, it will eat more bits and push the boundary over; if it’s a very small number, it will eat fewer bits and the boundary will move the other way. The more bits the mantissa is given, the more precise fractions it can describe. This yields a pretty elegant system where numbers that are very close to 0 (and thus, have very small bases and big mantissas) can hold very precise fractions, while numbers that are extremely large (huge bases and tiny mantissas) start to lose out on precision. Since, as describe before, floats aren’t actually int + decimal, this trading of bases for precision can even be completely inverted, where whole numbers themselves start being skipped over. This makes floats useless for counting at very high values, but allow them to store way, WAY larger values than integers can if you don’t care about being super duper accurate.

The result of this compromise is that the float can only store a finite number of fractions between any two whole numbers. These fractions are more or less on equal intervals where the interval size is determined by the number of bits that the mantissa has access to. This creates granular “steps” to decimals between numbers. Anything that comes up in math operations that doesn’t land neatly on one of those steps will get rounded to the closest available step. **Your error happens because the fraction `1/10` does NOT land on one of these little steps.** It gets rounded to something very close to, but *not quite equal to* 1/10. If this was a calculation with very messy decimals and you only care about the answer precise to two or three decimal places this wouldn’t matter, but if you are trying to add two numbers like this together and you’re expecting a clean, precise output, you will quickly notice the tiny errors building up.

Does the Posit number type get around the limitations of floating point values being rounded? I have briefly looked at it, but I’m still not familiar enough with Posit to know.

The problem is due to the fact that tenths cannot be accurately stored in a binary floating point representation. Thus, every number with a decimal contains a bit of rounding error, which typically isn’t shown. Typing 0.1 doesn’t yield a value that is *precisely* 0.1, and similarly with 0.2 and 0.3, but the error doesn’t scale linearly with the size of the number (it honestly doesn’t scale at all). E.g. the error in 0.2 is not twice as large as for 0.1 just because 0.2 is twice the size of 0.1.

If you want precision, you have to use integers, not floating point numbers, or a specific decimal formatting that gets around this problem. Alternatively, use a programming language that’s actually intended for computational work, like FORTRAN or MATLAB or whatever.