简体   繁体   中英

How computer does floating point arithmetic?

I have seen long articles explaining how floating point numbers can be stored and how the arithmetic of those numbers is being done, but please briefly explain why when I write

cout << 1.0 / 3.0 <<endl;

I see 0.333333 , but when I write

cout << 1.0 / 3.0 + 1.0 / 3.0 + 1.0 / 3.0 << endl;

I see 1 .

How does the computer do this? Please explain just this simple example. It is enough for me.

Let's do the math. For brevity, we assume that you only have four significant (base-2) digits.

Of course, since gcd(2,3)=1 , 1/3 is periodic when represented in base-2. In particular, it cannot be represented exactly, so we need to content ourselves with the approximation

A := 1×1/4 + 0×1/8 + 1×1/16 + 1*1/32

which is closer to the real value of 1/3 than

A' := 1×1/4 + 0×1/8 + 1×1/16 + 0×1/32

So, printing A in decimal gives 0.34375 (the fact that you see 0.33333 in your example is just testament to the larger number of significant digits in a double ).

When adding these up three times, we get

A + A + A
= ( A + A ) + A
= ( (1/4 + 1/16 + 1/32) + (1/4 + 1/16 + 1/32) ) + (1/4 + 1/16 + 1/32)
= (   1/4 + 1/4 + 1/16 + 1/16 + 1/32 + 1/32   ) + (1/4 + 1/16 + 1/32)
= (      1/2    +     1/8         + 1/16      ) + (1/4 + 1/16 + 1/32)
=        1/2 + 1/4 +  1/8 + 1/16  + 1/16 + O(1/32)

The O(1/32) term cannot be represented in the result, so it's discarded and we get

A + A + A = 1/2 + 1/4 + 1/8 + 1/16 + 1/16 = 1

QED :)

The problem is that the floating point format represents fractions in base 2.

The first fraction bit is ½, the second ¼, and it goes on as 1 / 2 n .

And the problem with that is that not every rational number (a number that can be expressed as the ratio of two integers) actually has a finite representation in this base 2 format.

(This makes the floating point format difficult to use for monetary values. Although these values are always rational numbers ( n /100) only .00, .25, .50, and .75 actually have exact representations in any number of digits of a base two fraction. )

Anyway, when you add them back, the system eventually gets a chance to round the result to a number that it can represent exactly.

At some point, it finds itself adding the .666... number to the .333... one, like so:

  00111110 1  .o10101010 10101010 10101011
+ 00111111 0  .10101010 10101010 10101011o
------------------------------------------
  00111111 1 (1).0000000 00000000 0000000x  # the x isn't in the final result

The leftmost bit is the sign, the next eight are the exponent, and the remaining bits are the fraction. In between the exponent and the fraction is an assummed "1" that is always present, and therefore not actually stored, as the normalized leftmost fraction bit. I've written zeroes that aren't actually present as individual bits as o .

A lot has happened here, at each step, the FPU has taken rather heroic measures to round the result. Two extra digits of precision (beyond what will fit in the result) have been kept, and the FPU knows in many cases if any, or at least 1, of the remaining rightmost bits were one. If so, then that part of the fraction is more than 0.5 (scaled) and so it rounds up. The intermediate rounded values allow the FPU to carry the rightmost bit all the way over to the integer part and finally round to the correct answer.

This didn't happen because anyone added 0.5; the FPU just did the best it could within the limitations of the format. Floating point is not, actually, inaccurate. It's perfectly accurate, but most of the numbers we expect to see in our base-10, rational-number world-view are not representable by the base-2 fraction of the format. In fact, very few are.

As for this specific example: I think the compilers are too clever nowadays, and automatically make sure a const result of primitive types will be exact if possible. I haven't managed to fool g++ into doing an easy calculation like this wrong.

However, it's easy to bypass such things by using non-const variables. Still,

int d = 3;
float a = 1./d;
std::cout << d*a;

will exactly yield 1, although this shouldn't really be expected. The reason, as was already said, is that the operator<< rounds the error away.

As to why it can do this: when you add numbers of similar size or multiply a float by an int , you get pretty much all the precision the float type can maximally offer you - that means, the ratio error/result is very small (in other words, the errors occur in a late decimal place, assuming you have a positive error).

So 3*(1./3) , even though, as a float, not exactly ==1 , has a big correct bias which prevents operator<< from taking care for the small errors. However, if you then remove this bias by just substracting 1, the floating point will slip down right to the error, and suddenly it's not neglectable at all any more. As I said, this doesn't happen if you just type 3*(1./3)-1 because the compiler is too clever, but try

int d = 3;
float a = 1./d;
std::cout << d*a << " - 1 = " <<  d*a - 1 << " ???\n";

What I get (g++, 32 bit Linux) is

1 - 1 = 2.98023e-08 ???

之所以有效,是因为默认精度为6位,并且四舍五入为6位结果为1。请参见C ++草稿标准(n3092)中的27.5.4.1 basic_ios构造函数。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM