简体繁体 English

浮点计算和舍入

[英]Floating point computations and rounding

原文 2012-06-13 18:27:58 8 3 c++/ floating-point/ cpu/ rounding

I think I read somewhere that cpu's "do some floating point computations in 50 bits so that they can round down to 32 bits correctly". 我想我读过某处cpu的“以50位进行一些浮点计算，以便它们可以正确舍入到32位”。 What I think this means is that the intermediate value computed in aforementioned floating point operation is computed in 50 bits so that "correct rounding to float size" can be achieved. 我认为这意味着在上述浮点运算中计算出的中间值以50位计算，从而可以实现“正确舍入为浮点大小”。

What does this statement mean? 这句话是什么意思？ Is it true? 是真的吗 If so, can you point me to some resources which explain why they need to compute 18 extra bits? 如果是这样，您能指出一些资源来解释为什么它们需要计算额外的18位吗？ Why not 19 or 17? 为什么不选择19或17？ Or is it just plain false? 还是只是纯属虚假？

Edit:- I found this link which is quite helpful and exhaustive... http://www.gamasutra.com/view/news/167402/Indepth_Intermediate_floatingpoint_precision.php 编辑：-我发现此链接非常有用且详尽无遗... http://www.gamasutra.com/view/news/167402/Indepth_Intermediate_floatingpoint_precision.php

thanks 谢谢

3 个解决方案

I can't guarantee it by any means, but I'd guess what you ran into was really 53 bits rather than 50. The reason they'd use 53 bits is because that's the next standard size of floating point type. 我无法保证，但是我猜您遇到的实际上是53位而不是50位。之所以使用53位，是因为这是浮点类型的下一个标准大小。 In the IEEE 754 standard, the smallest type is 32 bits total. 在IEEE 754标准中，最小的类型是总共32位。 The next size up is 64 bits total, which has a 53-bit significand (aka mantissa). 下一个增加的大小总计为64位，其中有效位数为53位（即尾数）。 Since they already have hardware in place to deal specifically with that size, it's probably easiest (in most cases) to carry out the calculation at that size, and then round to the smaller size. 由于他们已经准备好了专门用于处理该大小的硬件，因此（在大多数情况下）以该大小执行计算，然后舍入为较小的大小，可能是最容易的。

It is common on modern computers that computing in double-precision (1 sign bit, 11 exponent bit, 52 explicit significand bits) is as fast as computing in single-precision (1 sign bit, 8 exponent bits, 23 significand bits). 在现代计算机上，双精度（1个符号位，11个指数位，52个有效位）的计算速度与单精度（1个符号位，8个指数位，23个有效位）的计算速度一样快。 Therefore, when you load float objects, calculate, and store float objects, the compiler may load the float values into double-precision registers, calculate in double-precision, and store single-precision results. 因此，当您加载浮点对象，计算并存储浮点对象时，编译器可能会将浮点值加载到双精度寄存器中，以双精度计算并存储单精度结果。 This benefits you by providing extra precision at very little cost. 通过以极低的成本提供额外的精度，您可以从中受益。 Results may be more often “correctly rounded” (the result returned is the representable value nearest the mathematically exact result), but this is not guaranteed (because there are still rounding errors, which can interact in unexpected ways) or may often be more accurate (closer to the exact result than float calculations would provide) (but that is also not guaranteed), but, in rare cases, a double-precision calculation can return a result worse than single-precision calculation. 结果可能经常被“正确舍入”（返回的结果是最接近数学精确结果的可表示值），但这不能保证（因为仍然存在舍入误差，可能以意想不到的方式进行交互），或者通常更准确（比浮点计算提供的结果更接近精确结果）（但这也不能保证），但是在极少数情况下，双精度计算的结果可能比单精度计算的结果差。

There are times when double-precision is more expensive than single-precision, notably when performing SIMD programming. 有时双精度比单精度更昂贵，特别是在执行SIMD编程时。

Commonly, high-level languages leave the compiler free to decide how to evaluate floating-point expressions, so a compiler may use single-precision or double-precision depending on the vendor's choices (or quality of the compiler), optimization and target switches you have passed to the compiler, other aspects of the code being compiled (eg, availability of machine registers to do the calculations in), and other factors that may be random for practical purposes. 通常，高级语言使编译器可以自由决定如何评估浮点表达式，因此，编译器可以根据供应商的选择（或编译器的质量），优化和目标切换来使用单精度或双精度。已经传递给编译器，正在编译的代码的其他方面（例如，用于进行计算的机器寄存器的可用性）以及出于实际目的可能是随机的其他因素。 So this is not behavior you can rely on. 因此，这不是您可以依靠的行为。

Another meaning for what you heard might be that library routines for single-precision functions, such as sinf or logf, may be written in double-precision so that it is easier for them to get the desired results than if they had to be written entirely in single-precision. 您所听到的另一含义可能是单精度函数（例如sinf或logf）的库例程可以用双精度编写，因此与必须完全编写相比，它们更容易获得所需的结果。单精度。 That is common. 这很常见。 However, such library routines are carefully written by experts who analyze the errors that may occur during the calculations, so it is not simply a matter of assuming that more bits give better results. 但是，此类库例程是由专家精心编写的，他们会分析计算过程中可能发生的错误，因此，假设更多的位会产生更好的结果并不是简单的问题。

This has to do with epsilon values. 这与ε值有关。 For example take the classic 0.1 + 0.2 problem: http://0.30000000000000004.com/ 例如，以经典的0.1 + 0.2问题为例：http: //0.30000000000000004.com/

In most languages, 0.1 + 0.2 != 0.3. 在大多数语言中，0.1 + 0.2！= 0.3。 That is because while 0.1 and 0.2 are terminating decimals in base 10, in base 2, 0.1 looks like 0.0001100110011... and 0.2 looks like 0.001100110011... so that means when you add the two values together, you will get a repeating binary number that approaches 0.3 when you get infinite precision, similar to how 0.333333333... + 0.33333333.... approaches 2/3 as you get more and more precision. 那是因为虽然0.1和0.2在以10为底的十进制小数点处终止，但在以2为底的0.1看起来像0.0001100110011 ...而0.2看起来像0.001100110011 ...，所以这意味着当您将两个值相加时，会得到一个重复的二进制数当您获得无限精确度时，数值接近0.3，类似于您获得越来越高的精确度时，0.333333333 ... + 0.33333333 ....接近2/3的方式。

In terms of why 18 extra bits vs 19 extra bits, that's a more complex discussion. 关于为什么要增加18位还是增加19位，这是一个更复杂的讨论。 See http://en.wikipedia.org/wiki/Machine_epsilon for more details. 有关更多详细信息，请参见http://en.wikipedia.org/wiki/Machine_epsilon 。