C ++浮点除法和精度

Question

I know that 511 divided by 512 actually equals 0.998046875. 我知道511除以512实际上等于0.998046875。 I also know that the precision of floats is 7 digits. 我也知道浮子的精度是7位数。 My question is, when I do this math in C++ (GCC) the result I get is 0.998047, which is a rounded value. 我的问题是，当我用C ++（GCC）进行数学运算时，得到的结果是0.998047，这是一个舍入值。 I'd prefer to just get the truncated value of 0.998046, how can I do that? 我更喜欢得到截断值0.998046，我该怎么做？

  float a = 511.0f;
  float b = 512.0f;
  float c = a / b;

Answer 1

Well, here's one problem. 嗯，这是一个问题。 The value of 511/512 , as a float , is exact. 作为float的511/512的值是精确的。 No rounding is done. 没有舍入。 You can check this by asking for more than seven digits: 您可以通过要求超过七位数来检查：

#include <stdio.h>
int main(int argc, char *argv[])
{
    float x = 511.0f, y = 512.0f;
    printf("%.15f\n", x/y);
    return 0;
}

Output: 输出：

0.998046875000000

A float is stored not as a decimal number, but binary. float不是十进制数，而是二进制数。 If you divide a number by a power of 2, such as 512, the result will almost always be exact. 如果将数字除以2的幂，例如512，则结果几乎总是精确的。 What's going on is the precision of a float is not simply 7 digits, it is really 23 bits of precision. 发生了什么是float的精度不仅仅是7位数，它实际上是23 位精度。

See What Every Computer Scientist Should Know About Floating-Point Arithmetic . 看看每个计算机科学家应该知道的关于浮点运算的内容。

Answer 2

I also know that the precision of floats is 7 digits. 我也知道浮子的精度是7位数。

No. The most common floating point format is binary and has a precision of 24 bits. 不是。最常见的浮点格式是二进制格式，精度为24位。 It is somewhere between 6 and 7 decimal digits but you can't think in decimal if you want to understand how rounding work. 它介于6到7位十进制数字之间，但如果您想了解舍入工作的方式，则无法用十进制表示。

As b is a power of 2, c is exactly representable. 由于b是2的幂，c是完全可表示的。 It is during the conversion in a decimal representation that rounding will occurs. 在十进制表示转换期间，将发生舍入。 The standard ways of getting a decimal representation don't offer the possibility to use truncation instead of rounding. 获取十进制表示的标准方法不提供使用截断而不是舍入的可能性。 One way would be to ask for one more digit and ignore it. 一种方法是要求多一个数字并忽略它。

But note that the fact that c is exactly representable is a property of its value. 但请注意，c完全可表示的事实是其值的属性。 SOme apparently simpler values (like 0.1) don't have an exact representation in binary FP formats. SOme显然更简单（如0.1）没有二进制FP格式的精确表示。

Answer 3

That 'rounded' value is most likley what is displayed through some output method rather than what is actually stored. 通过某种输出方法而不是实际存储的内容，“圆润”值最有可能显示出来。 Check the actual value in your debugger. 检查调试器中的实际值。

With iostream and stdio, you can specify the precision of the output. 使用iostream和stdio，您可以指定输出的精度。 If you specify 7 significant digits, convert it to a string, then truncate the string before display you will get the output without rounding. 如果指定7位有效数字，将其转换为字符串，然后在显示之前截断字符串，您将获得输出而不进行舍入。

Can't think of one reason why you would want to do that however, and given the subseqent explanation of teh application, you'd be better off using double precision, though that will most likely simply shobe problems to somewhere else. 想不出你想要这样做的一个原因，并且考虑到应用程序的后续解释，你最好使用双精度，尽管这很可能只是将问题转移到其他地方。

Answer 4

Your question is not unique, it has been answered numerous times before. 你的问题并不是独一无二的，之前已经多次回答。 This is not a simple topic and just because answers are posted doesn't necessarily mean they'll be of good quality. 这不是一个简单的主题，只是因为发布答案并不一定意味着他们的质量会很好。 If you browse a little you'll find the really good stuff. 如果你浏览一下，你会发现真正好的东西。 And it will take you less time. 它会花费你更少的时间。

I bet someone will -1 me for commenting and not answering. 我敢打赌有人会对我进行评论而不回答。

_____ Edit _____ _____编辑_____

What is fundamental to understanding floating point is to realize that everything is displayed in binary digits. 理解浮点的基础是要意识到一切都以二进制数字显示。 Because most people have trouble grasping this they try to see it from the point of view of decimal digits. 因为大多数人都难以理解这一点，所以他们试图从小数位的角度来看待它。

On the subject of 511/512 you can start by looking at the value 1.0. 关于511/512的主题，您可以从值1.0开始。 In floating point this could be expressed as i.000000... * 2^0 or implicit bit set (to 1) multiplied by 2^0 ie equals 1. Since 511/512 is less than 1 you need to start with the next lower power -1 giving i.000000... * 2^-1 ie 0.5. 在浮点数，这可以表示为i.000000 ... * 2 ^ 0或隐含位设置（到1）乘以2 ^ 0即等于1.由于511/512小于1，您需要从下一个开始低功率-1给出i.000000 ... * 2 ^ -1即0.5。 Notice that the only thing that has changed is the exponent. 请注意，唯一改变的是指数。 If we want to express 511 in binary we get 9 ones - 111111111 or in floating point with implicit bit i.11111111 - which we can divide by 512 and put together with the exponent of -1 giving i.1111111100... * 2^-1. 如果我们想用二进制表示511，我们得到9个--111111111或浮点与隐式位i.11111111 - 我们可以除以512并将指数放在-1给i.1111111100 ... * 2 ^ -1。

How does this translate to 0.998046875? 这怎么转化为0.998046875？

Well to begin with the implicit bit represents 0.5 (or 2^-1), the first explicit bit 0.25 (2^-2), the next explicit bit 0.125 (2^-3), 0.0625, 0.03125 and so on until you've represented the ninth bit (eighth explicit). 那么从隐式位开始表示0.5（或2 ^ -1），第一个显式位0.25（2 ^ -2），下一个显式位0.125（2 ^ -3），0.0625,0.03125等等，直到你' ve代表第九位（第八位显式）。 Sum them up and you get 0.998046875. 总结一下，得到0.998046875。 From the i.11111111 we find that this number represents 9 binary digits of precision and, coincidentally, 9 decimal precision. 从i.11111111我们发现这个数字代表精度的9位二进制数字，巧合的是9位小数精度。

If you multiply 511/512 by 512 you will get i1111111100... * 2^8. 如果您将511/512乘以512，您将获得i1111111100 ...... * 2 ^ 8。 Here there are the same nine binary digits of precision but only three decimal digits (for 511). 这里有九个精确的二进制数字，但只有三个十进制数字（511）。

Consider i.11111111111111111111111 (i + 23 ones) * 2^-1. 考虑i.11111111111111111111111（i + 23个）* 2 ^ -1。 We will get a fraction (2^(24-1)^/(2^24))with 24 binary and 24 decimal digits of precision. 我们将获得具有24个二进制和24个十进制数字精度的分数（2 ^（24-1）^ /（2 ^ 24））。 Given an appropriate printf formatting all 24 decimal digits will be displayed. 给定适当的printf格式，将显示所有24位十进制数字。 Multiply it by 2^24 and you still have 24 binary digits of precision but only 8 decimal (for 16777215). 乘以2 ^ 24，你仍然有24个二进制数字的精度，但只有8位小数（16777215）。

Now consider i.1111100... * 2^2 which comes out to 7.875. 现在考虑i.1111100 ... * 2 ^ 2，它出现在7.875。 i11 is the integer part and 111 the fraction part (111/1000 or 7/8ths). i11是整数部分，111是分数部分（111/1000或7/8）。 6 binary digits of precision and 4 decimal. 6位二进制数字的精度和4位小数。

Thinking decimal when doing floating-point is utterly detrimental to understanding it. 做浮点时的思维十进制对于理解它是完全不利的。 Free yourself! 释放自己！

Answer 5

If you are just interested in the value, you could use double and then multiply the result by 10^6 and floor it. 如果您只对该值感兴趣，可以使用double，然后将结果乘以10 ^ 6并将其置于最低位置。 Divide again by 10^6 and you will get the truncated value. 再次除以10 ^ 6，您将得到截断值。

C ++浮点除法和精度

问题描述

5 个解决方案

解决方案1
22 已采纳 2011-05-14 16:45:52

解决方案2
5 2011-05-14 16:57:02

解决方案3
1 2011-05-14 17:05:07

解决方案4
1 2011-05-17 11:38:28

解决方案5
0 2011-05-14 16:36:54

C ++浮点除法和精度

问题描述

5 个解决方案

解决方案1 22 已采纳 2011-05-14 16:45:52

解决方案2 5 2011-05-14 16:57:02

解决方案3 1 2011-05-14 17:05:07

解决方案4 1 2011-05-17 11:38:28

解决方案5 0 2011-05-14 16:36:54

解决方案1
22 已采纳 2011-05-14 16:45:52

解决方案2
5 2011-05-14 16:57:02

解决方案3
1 2011-05-14 17:05:07

解决方案4
1 2011-05-17 11:38:28

解决方案5
0 2011-05-14 16:36:54