Why can’t computers calculate decimals as floats properly?

326 views

Why can’t computers calculate decimals as floats properly?

In: 2

9 Answers

Anonymous 0 Comments

They actually can, if they are programmed to do so. But they don’t do this “natively” – so to speak. There are certain libraries that handle what we call “arbitrary precision arithmetic” and they can do exactly what you describe, but at a very hefty performance cost. (Warning, this post will be very long. But I’ll try to explain exactly why).

The reason that arbitrary precision arithmetic isn’t really the default way we handle decimals is because it’s not the “language” that the CPU speaks. The computer only innately understands binary, and so if we’re trying to do things related to base-10 fractional numbers, we have to find ways of trying to decide how to represent it in binary first. Luckily, this isn’t particularly complicated for whole numbers. We can easily calculate what any number’s binary equivalent will be, and vice-versa back to base-10. But what about decimals?

This is where things get way more complicated. Let’s take the number 15.1. How do we represent this in binary? Do we make a number like 00001111.00000001? If we decide to represent it this way, how would we differentiate between 15.1 and, say, 15.01? Do we also make that number 00001111.00000001? How do we tell the difference between these two?

As you can tell, it creates a problem. Where do we put the decimal point? It’s totally ambiguous what this number could mean because we have no idea how many 0s are supposed to be after the decimal point. And in fact, our problems don’t just end there. Our base 10 number used 10 digits (0 to 9), but we had a separate symbol to represent the decimal point, so we really have 11 symbols. The computer only has two “symbols” for binary, and those are 1 and 0. They represent either a wire being on or off. We would need some kind of tri-state transistor that could take three symbols to even represent the decimal point at all, and CPUs don’t use these, they do everything in pure binary. So, as you can see, we have multiple problems to solve here.

**To solve this (and to represent fractional numbers properly in binary), we’re going to need to create some kind of convention/format that we can agree on.** One primitive way to start would be to basically do what we, as humans, do when we’re solving arithmetic on paper, and just split everything up one digit at a time. Let’s assume that we want to convert decimal to binary like this, and since we only have digits 0 through 9 to worry about, we only really NEED 4 bits (a total of 16 combinations) to represent this. This leaves us some extra ones, but luckily, this is a good thing since we need to pick one to signify our decimal point too. I’m going to pick 1111 to signify our totally arbitrarily decided “binary decimal point” – so anywhere you see 1111 below, it’s a decimal point.

15.1 becomes: 0001 0101 1111 0001

15:01 becomes 0001 0101 1111 0000 0001

1000.16 becomes: 0001 0000 0000 0000 1111 0001 0110

23,598.776 becomes 0010 0011 0101 1001 1000 1111 0111 0111 0110

*(Remember, we decided to make 1111 signify our decimal point. We decided this arbitrarily, we could have made it anything that wasn’t already taken, but 1111 is convenient because it’s easy to remember.)*

**This would work, but there is still a problem: This is hugely inefficient. We just created a massive number of additional calculations we have to do.**

Think about it. We’ve already used this many digits just for small numbers, and if we were to try to add it up, we would have to do tons of little calculations on each digit (just like we do as humans on a sheet of paper). And this would be become monstrously expensive with larger numbers, because we’ve effectively “wasted” a lot of bits doing this. *As you can probably see, even though our convention makes intuitive sense to us, it’s not really the most efficient way for the CPU to handle things since we’re having to create a totally abstract number system, translate to it and process it, then translate back.*

So, to get around this, we came up with a different language/format for representing fractional numbers in the 1980s, and this was called IEEE 754. It involves a totally different way of representing the numbers that uses scientific notation instead. Rather than inefficiently trying to force a base 10 representation of the digits and performing calculations on them individually, we can now perform calculations on the entire number in binary and do it all at once.

IEEE 754 basically has three parts: A sign (the very first bit that tells us whether it’s positive or negative), another 8 bits that signify an exponent, and 23 bits (called the “mantissa”) that signify the fraction following the decimal. We can perform calculations on this “in native binary” rather than breaking things up into arbitrary decimal-like representations, and it makes calculations FAR more efficient for the CPU and saves us a lot of hassle and time.

*You might be wondering why we represent every number using just 32 bits, but believe it or not, we can actually handle an incredibly wide range of numbers using just these 32 bits. We can represent small numbers with very decent precision like this, and if we’re willing to tolerate a little bit of a larger margin of error, we can represent huge numbers too (Numbers with hundreds of digits? No problem. Powerball jackpot multiplied by the number of atoms in the observable universe? Coming your way, plus or minus a few dollars.). We also have 64 bit floats that are known as “double precision” floats, and these have more bits and can be more precise when we need them to be.*

**IEEE 754 became standardized in the 80s, and now nearly all CPUs (and even most microcontrollers) now have dedicated hardware and CPU instructions that can process them.** They are far more efficient, way easier to calculate in binary, and are usually more than precise enough for most of what we do (and even give us tons of extra flexibility to do things we couldn’t do in ordinary binary, such as represent numbers with hundreds of digits with only 32 bits). Even though there are a huge list of reasons for why it was adopted, there are still exceptions for where it isn’t quite practical (bank transactions are a perfect example). We have arbitrary precision arithmetic libraries to handle these kinds of situations, but they are inefficient enough that we don’t usually use these by default.

You are viewing 1 out of 9 answers, click here to view all answers.