简体   繁体   中英

Floating point representations seem to do integer arithmetic correctly - why?

I've been playing around with floating point numbers a little bit, and based on what I've learned about them in the past, the fact that 0.1 + 0.2 ends up being something like 0.30000000000000004 doesn't surprise me.

What does surprise me, however, is that integer arithmetic always seems to work just fine and not have any of these artifacts.

I first noticed this in JavaScript (Chrome V8 in node.js):

0.1 + 0.2 == 0.3 // false, NOT surprising
123456789012 + 18 == 123456789030  // true
22334455667788 + 998877665544 == 23333333333332 // true
1048576 / 1024 == 1024  // true

C++ (gcc on Mac OS X) seems to have the same properties.

The net result seems to be that integer numbers just — for lack of a better word — work . It's only when I start using decimal numbers that things get wonky.

Is this is a feature of the design, an mathematical artifact, or some optimisation done by compilers and runtime environments?

Is this is a feature of the design, an mathematical artifact, or some optimisation done by compilers and runtime environments?

It's a feature of the real numbers. A theorem from modern algebra (modern algebra, not high school algebra; math majors take a class in modern algebra after their basic calculus and linear algebra classes) says that for some positive integer b , any positive real number r can be expressed as r = a * b p , where a is in [1, b ) and p is some integer. For example, 1024 10 = 1.024 10 *10 3 . It is this theorem that justifies our use of scientific notation.

That number a can be classified as terminal (eg 1.0), repeating (1/3=0.333...), or non-repeating (the representation of pi). There's a minor issue here with terminal numbers. Any terminal number can be also be represented as a repeating number. For example, 0.999... and 1 are the same number. This ambiguity in representation can be resolved by specifying that numbers that can be represented as terminal numbers are represented as such.

What you have discovered is a consequence of the fact that all integers have a terminal representation in any base.

There is an issue here with how the reals are represented in a computer. Just as int and long long int don't represent all of integers, float and double don't represent all of the reals. The scheme used on most computer to represent a real number r is to represent in the form r = a *2 p , but with the mantissa (or significand) a truncated to a certain number of bits and the exponent p limited to some finite number. What this means is that some integers cannot be represented exactly. For example, even though a googol (10 100 ) is an integer, it's floating point representation is not exact. The base 2 representation of a googol is a 333 bit number. This 333 bit mantissa is truncated to 52+1 bits.

On consequence of this is that double precision arithmetic is no longer exact, even for integers if the integers in question are greater than 2 53 . Try your experiment using the type unsigned long long int on values between 2 53 and 2 64 . You'll find that double precision arithmetic is no longer exact for these large integers.

I'm writing that under assumption that Javascript uses double-precision floating-point representation for all numbers.

Some numbers have an exact representation in the floating-point format, in particular, all integers such that |x| < 2^53 |x| < 2^53 . Some numbers don't, in particular, fractions such as 0.1 or 0.2 which become infinite fractions in binary representation.

If all operands and the result of an operation have an exact representation, then it would be safe to compare the result using == .

Related questions:

What number in binary can only be represented as an approximation?

Why can't decimal numbers be represented exactly in binary?

Integers withing the representable range are exactly representable by the machine, floats are not (well, most of them).

If by "basic integer math" you understand "feature", then yes, you can assume correctly implementing arithmetic is a feature.

The reason is, that you can represent every whole number (1, 2, 3, ...) exactly in binary format (0001, 0010, 0011, ...)

That is why integers are always correct because 0011 - 0001 is always 0010. The problem with floating point numbers is, that the part after the point cannot be exactly converted to binary.

All of the cases that you say "work" are ones where the numbers you have given can be represented exactly in the floating point format. You'll find that adding 0.25 and 0.5 and 0.125 works exactly too because they can also be represented exactly in a binary floating point number.

it's only values that can't be such as 0.1 where you'll get what appear to be inexact results.

Integers are exact because because the imprecision results mainly from the way we write decimal fractions, and secondarily because many rational numbers simply don't have non-repeating representations in any given base.

See: https://stackoverflow.com/a/9650037/140740 for the full explanation.

只有当你向一个非常大的整数添加一个足够小的整数时,该方法才有效 - 即使在这种情况下,你也不能以“浮点”格式表示两个整数。

All floating point numbers can't be represented. it's due to the way of coding them. The wiki page explain it better than me: http://en.wikipedia.org/wiki/IEEE_754-1985 . So when you are trying to compare a floating point number, you should use a delta:

myFloat - expectedFloat < delta

You can use the smallest representable floating point number as delta.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM