浮点运算如何在计算机上进行？

Question

I have seen long articles explaining how floating point numbers can be stored and how the arithmetic of those numbers is being done, but please briefly explain why when I write 我看过很长的文章，解释了如何存储浮点数以及如何对这些数字进行算术运算，但是请简要解释为什么我写时

cout << 1.0 / 3.0 <<endl;

I see 0.333333 , but when I write 我看到0.333333 ，但是当我写的时候

cout << 1.0 / 3.0 + 1.0 / 3.0 + 1.0 / 3.0 << endl;

I see 1 . 我看到了1 。

How does the computer do this? 电脑如何做到这一点？ Please explain just this simple example. 请只解释这个简单的例子。 It is enough for me. 对我来说足够了。

Answer 1

查阅有关“每位计算机科学家应了解的浮点算术知识”的文章

Answer 2

Let's do the math. 让我们做数学。 For brevity, we assume that you only have four significant (base-2) digits. 为简便起见，我们假设您只有四个有效数字（以2为基数）。

Of course, since gcd(2,3)=1 , 1/3 is periodic when represented in base-2. 当然，由于gcd(2,3)=1 ，所以当以base-2表示时， 1/3是周期性的。 In particular, it cannot be represented exactly, so we need to content ourselves with the approximation 特别是，它不能精确表示，因此我们需要对近似值感到满意

A := 1×1/4 + 0×1/8 + 1×1/16 + 1*1/32

which is closer to the real value of 1/3 than 比实际值更接近1/3

A' := 1×1/4 + 0×1/8 + 1×1/16 + 0×1/32

So, printing A in decimal gives 0.34375 (the fact that you see 0.33333 in your example is just testament to the larger number of significant digits in a double ). 因此，以十进制打印A会得到0.34375 （事实上，您在示例中看到的是0.33333 ，这恰恰证明了double中有更多有效数字）。

When adding these up three times, we get 将这些加起来三遍，我们得到

A + A + A
= ( A + A ) + A
= ( (1/4 + 1/16 + 1/32) + (1/4 + 1/16 + 1/32) ) + (1/4 + 1/16 + 1/32)
= (   1/4 + 1/4 + 1/16 + 1/16 + 1/32 + 1/32   ) + (1/4 + 1/16 + 1/32)
= (      1/2    +     1/8         + 1/16      ) + (1/4 + 1/16 + 1/32)
=        1/2 + 1/4 +  1/8 + 1/16  + 1/16 + O(1/32)

The O(1/32) term cannot be represented in the result, so it's discarded and we get O(1/32)项无法在结果中表示，因此将其丢弃，我们得到

A + A + A = 1/2 + 1/4 + 1/8 + 1/16 + 1/16 = 1

QED :) QED :)

Answer 3

The problem is that the floating point format represents fractions in base 2. 问题在于浮点格式表示以2为底的分数。

The first fraction bit is ½, the second ¼, and it goes on as 1 / 2 ⁿ . 第一个小数位是1/2，第二个小数位是1/4，然后继续为1/2 ⁿ 。

And the problem with that is that not every rational number (a number that can be expressed as the ratio of two integers) actually has a finite representation in this base 2 format. 这样做的问题在于，并不是每个有理数（一个可以表示为两个整数的比的数）实际上都以这种以2为基数的格式具有有限的表示形式。

(This makes the floating point format difficult to use for monetary values. Although these values are always rational numbers ( n /100) only .00, .25, .50, and .75 actually have exact representations in any number of digits of a base two fraction. ) （这使浮点格式难以用于货币值。尽管这些值始终是有理数（ n / 100），但实际上.00，.25，.50和.75只能以a的任意位数精确表示。以两个为基数。）

Anyway, when you add them back, the system eventually gets a chance to round the result to a number that it can represent exactly. 无论如何，当您将它们添加回去时，系统最终将有机会将结果四舍五入为可以精确表示的数字。

At some point, it finds itself adding the .666... number to the .333... one, like so: 在某个时候，它发现自己将.666 ...数字添加到.333 ...一个，就像这样：

  00111110 1  .o10101010 10101010 10101011
+ 00111111 0  .10101010 10101010 10101011o
------------------------------------------
  00111111 1 (1).0000000 00000000 0000000x  # the x isn't in the final result

The leftmost bit is the sign, the next eight are the exponent, and the remaining bits are the fraction. 最左边的位是符号，接下来的8位是指数，其余位是小数。 In between the exponent and the fraction is an assummed "1" that is always present, and therefore not actually stored, as the normalized leftmost fraction bit. 在指数和分数之间是假定的“ 1”，它始终作为标准化的最左边分数位存在，因此实际上并未存储。 I've written zeroes that aren't actually present as individual bits as o . 我写了零，它们实际上并不像o那样单独出现。

A lot has happened here, at each step, the FPU has taken rather heroic measures to round the result. 这里发生了很多事情，FPU在每一步都采取了相当英勇的措施来完善结果。 Two extra digits of precision (beyond what will fit in the result) have been kept, and the FPU knows in many cases if any, or at least 1, of the remaining rightmost bits were one. 保留了两位额外的精度（超出了结果的精度），FPU在许多情况下知道是否有剩余的最右边的位，或者至少有1个是一位。 If so, then that part of the fraction is more than 0.5 (scaled) and so it rounds up. 如果是这样，则该分数的那部分大于0.5（按比例缩放），因此将其四舍五入。 The intermediate rounded values allow the FPU to carry the rightmost bit all the way over to the integer part and finally round to the correct answer. 中间取整值允许FPU将最右边的位一直带到整数部分，最后取整为正确的答案。

This didn't happen because anyone added 0.5; 这没有发生，因为有人添加了0.5。 the FPU just did the best it could within the limitations of the format. FPU在格式限制内尽了最大的努力。 Floating point is not, actually, inaccurate. 实际上，浮点数并不准确。 It's perfectly accurate, but most of the numbers we expect to see in our base-10, rational-number world-view are not representable by the base-2 fraction of the format. 这是完全准确的，但是我们期望在以10为底的有理数世界视图中看到的大多数数字都无法用格式的以2为底的分数来表示。 In fact, very few are. 实际上，很少。

Answer 4

As for this specific example: I think the compilers are too clever nowadays, and automatically make sure a const result of primitive types will be exact if possible. 对于这个特定的示例：我认为当今的编译器太聪明了，并且如果可能的话，会自动确保原始类型的const结果正确。 I haven't managed to fool g++ into doing an easy calculation like this wrong. 我没有设法愚弄g ++进行这样的错误的简单计算。

However, it's easy to bypass such things by using non-const variables. 但是，通过使用非常量变量可以很容易地绕开这些东西。 Still, 仍然，

int d = 3;
float a = 1./d;
std::cout << d*a;

will exactly yield 1, although this shouldn't really be expected. 会精确地产生1，尽管这不是真的可以预期的。 The reason, as was already said, is that the operator<< rounds the error away. 正如已经说过的，原因是operator<<将错误四舍五入。

As to why it can do this: when you add numbers of similar size or multiply a float by an int , you get pretty much all the precision the float type can maximally offer you - that means, the ratio error/result is very small (in other words, the errors occur in a late decimal place, assuming you have a positive error). 至于为什么可以这样做：当您将相似大小的数字相加或将float乘以一个int ，您将获得浮点数类型可以最大地为您提供的几乎所有精度-这意味着，比率误差/结果非常小（换句话说，假设您有一个肯定的错误，则错误发生在小数点后一位。

So 3*(1./3) , even though, as a float, not exactly ==1 , has a big correct bias which prevents operator<< from taking care for the small errors. 因此，即使3*(1./3)作为浮点数（不完全是==1 ）也具有较大的正确偏差，这会阻止operator<<照顾小错误。 However, if you then remove this bias by just substracting 1, the floating point will slip down right to the error, and suddenly it's not neglectable at all any more. 但是，如果您仅减去1就消除了这种偏差，则浮点将向下滑动到错误的位置，突然之间，它不再是可以忽略的。 As I said, this doesn't happen if you just type 3*(1./3)-1 because the compiler is too clever, but try 就像我说的那样，如果您只键入3*(1./3)-1不会发生这种情况，因为编译器太聪明了，但是请尝试

int d = 3;
float a = 1./d;
std::cout << d*a << " - 1 = " <<  d*a - 1 << " ???\n";

What I get (g++, 32 bit Linux) is 我得到的（g ++，32位Linux）是

1 - 1 = 2.98023e-08 ???

Answer 5

之所以有效，是因为默认精度为6位，并且四舍五入为6位结果为1。请参见C ++草稿标准（n3092）中的27.5.4.1 basic_ios构造函数。

浮点运算如何在计算机上进行？

问题描述

5 个解决方案

解决方案1
28 2011-05-17 15:28:49

解决方案2
17 2011-05-17 16:08:10

解决方案3
17 已采纳 2011-05-19 06:54:14

解决方案4
2 2011-05-17 17:49:14

解决方案5
0 2011-05-18 20:07:41

浮点运算如何在计算机上进行？

问题描述

5 个解决方案

解决方案1 28 2011-05-17 15:28:49

解决方案2 17 2011-05-17 16:08:10

解决方案3 17 已采纳 2011-05-19 06:54:14

解决方案4 2 2011-05-17 17:49:14

解决方案5 0 2011-05-18 20:07:41

解决方案1
28 2011-05-17 15:28:49

解决方案2
17 2011-05-17 16:08:10

解决方案3
17 已采纳 2011-05-19 06:54:14

解决方案4
2 2011-05-17 17:49:14

解决方案5
0 2011-05-18 20:07:41