将单精度浮点数转换为双精度以进行除法

Question

Working as a High-Performance-Computing guy, we tend to default to single-precision floating point numbers ( float or real ) whenever possible. 作为高性能计算专家，我们倾向于尽可能使用默认的单精度浮点数（ float或real ）。 This is because you can perform more operations per second if each operation is individually faster to perform. 这是因为如果每个操作的执行速度分别较快，则您每秒可以执行更多的操作。

One of the more senior people I work with, however, always insists that (when accuracy is required) you should temporarily convert your single-precision data to double-precision in order to perform division. 但是，与我合作的一位资深人士始终坚持（需要精度时），应暂时将单精度数据转换为双精度数据以进行除法。 That is: 那是：

float a, b;
float ans = ((double)a)/((double)b);

or 要么

real :: a, b, ans
ans = real(dble(a)/dble(b))

depending on the language you're working in. In my opinion, this looks really ugly, and to be honest I don't even know if the answer held in ans will be more accurate than if you had simply written ans = a/b in single-point precision. 这取决于你的工作语言。在我看来，这看起来非常难看，而且说实话，我甚至不知道，如果在举行的答案ans会更加准确比，如果你只是简单的写ans = a/b单点精度。

Can someone tell me whether converting your numbers prior to arithmetic, specifically for performing division , will actually result in a more accurate answer? 有人可以告诉我在算术之前转换您的数字（ 特别是进行除法运算）是否实际上会得出更准确的答案？ Would this be a language/compiler specific question, or would this be up to IEEE? 这是语言/编译器特定的问题，还是由IEEE决定？ With what number values would this accuracy improvement be most noticeable? 用什么数值可以最明显地看出这种准确性？

Any enlightening comments/answers would be much appreciated. 任何启发性的评论/答案将不胜感激。

Answer 1

float ans = ((double)a)/((double)b); float ans =（（（double）a）/（（double）b）;

This article demonstrates that ans is always the same as would be computed by a single-precision division for IEEE 754 arithmetics and FLT_EVAL_METHOD=0. 本文演示了ans始终与IEEE 754算术且FLT_EVAL_METHOD = 0的单精度除法计算的ans始终相同。

When FLT_EVAL_METHOD=1, the same property is also trivially true. 当FLT_EVAL_METHOD = 1时，相同的属性也同样适用。

When FLT_EVAL_METHOD=2, I am not sure. 当FLT_EVAL_METHOD = 2时，我不确定。 It is possible that one might interpret the rules as meaning that the long double computation of a/b must first be rounded to double , then to float . 这可能是一个可能的解释规则，这意味着long double的计算a/b必须先四舍五入至double ，再到float 。 In this case, it can be less accurate than directly rounding from long double to float (the latter produces the correctly rounded results, whereas the former could fail to do so in extremely rare cases, unless another theorem like Figueroa's applies and shows that this never happens). 在这种情况下，它的精度可能不如直接从long double精度数舍入到float （后者会产生正确的舍入结果，而前者在极少见的情况下可能无法做到这一点，除非应用诸如Figueroa的另一个定理并且证明这永远不会发生）。

Long story short, for any modern, reasonable floating-point computing platform (*), it is superstition that float ans = ((double)a)/((double)b); 长话短说，对于任何现代的，合理的浮点计算平台（*）， float ans = ((double)a)/((double)b); has any benefits. 有任何好处。 You should ask the senior people you refer to in your question to exhibit one pair a, b of values for which the result is different, not to mention more accurate. 您应该让您在问题中提及的资深人士展示一对a, b值，其结果不同，更不用说更准确了。 Surely if they insist that this is better it should be no trouble for them to provide one single pair of values for which it makes a difference. 当然，如果他们坚持认为这样做会更好，那么为他们提供一对有价值的一对值应该没有问题。

(*) remember to use -fexcess-precision=standard with GCC to preserve your sanity （*）记得在GCC中使用-fexcess-precision=standard来保持理智

Answer 2

This depends greatly on what platform is being used. 这在很大程度上取决于所使用的平台。

An 80x86 (or a 1980s-era 8087) using non-SSE instructions performs all its arithmetic using 80-bit precision ( long double or real*10 ). 使用非SSE指令的80x86（或1980年代的8087）使用80位精度（ long double或real*10 ）执行其所有算术运算。 It is the "store" instruction which moves results from the numeric processor to memory which loses precision. 这是“存储”指令，会将结果从数字处理器移至内存，这会降低精度。

Unless it is a really bone-headed compiler, maximum precision should occur from 除非它是真正的傻瓜式编译器，否则应从

float a = something, b = something_else;
float ans = a/b;

since to perform the division, the single precision operands will be extended precision after loading and the result will be extended precision. 由于执行除法，单精度操作数将在加载后扩展精度，结果将扩展精度。

If you were doing something more intricate and wanted to maintain maximum precision, don't store intermediate results in smaller-sized variables: 如果您要进行更复杂的操作并希望保持最高的精度，请不要将中间结果存储在较小的变量中：

float a, b, c, d;

float prod_ad = a * d;
float prod_bc = b * c;
float sum_both = prod_ad + prod_bc;   // less accurate

That gives a less precise result than doing it all at once since most compilers will produce code which keeps all the intermediates values at extended precision: 由于大多数编译器都会生成使所有中间值保持扩展精度的代码，因此其结果不如一次完成那么精确。

float a, b, c, d;

float sum_both = a * d + b * c;   // more accurate

Building on Eugeniu Rosca's example program: 以Eugeniu Rosca的示例程序为基础：

#include "stdio.h"
void main(void)
{
    float a=73;
    float b=19;

    long double a1 = a;
    long double b1 = b;

    float ans1 = (a*a*a/b/b/b);
    float ans2 = ((double)a*(double)a*(double)a/(double)b/(double)b/(double)b);
    float ans3 = a1*a1*a1/b1/b1/b1;
    long double ans4 = a1*a1*a1/b1/b1/b1;

    printf ("plain:  %.20g\n", ans1);
    printf ("cast:   %.20g\n", ans2);
    printf ("native: %.20g\n", ans3);
    printf ("full:   %.20Lg\n", ans4);
}

provides, no matter the optimization level 提供，无论优化级别如何

plain:  56.716281890869140625
cast:   56.71628570556640625
native: 56.71628570556640625
full:   56.716285172765709289

This is showing that for trivial operations, there isn't much difference. 这表明对于微不足道的操作，没有太大的区别。 However, changing the constants to be more of a precision challenge: 但是，将常量更改为精度更高的挑战：

float a=0.333333333333333333333333;
float b=0.1;

provides 提供

plain:  37.03704071044921875
cast:   37.037036895751953125
native: 37.037036895751953125
full:   37.037038692721614131

where the precision difference is displaying a more pronounced effect. 精度差异显示出更明显的效果。

Answer 3

Yes, converting to double precision will give you better accuracy (or, shall I say, precision ) in division. 是的，转换为双精度将为您提供更好的除法精度（或者，我要说精度）。 One could say that this is up to IEEE, but only because IEEE defines the formats and standards. 可以说这取决于IEEE，但这仅仅是因为IEEE定义了格式和标准。 double s are inherently more precise than float s, with storage of numbers as well as division. double固有比float精确，带有数字存储和除法运算。

To answer your last question, this would be most noticeable with large a and small (less than 1) b , because then you end up with a very large quotient, in the range at which all floating point numbers are less granular. 要回答您的最后一个问题，对于大a和小b （小于1） b ，这将是最明显的，因为这样一来，您将得到非常大的商，即所有浮点数的粒度都不大的范围。

Answer 4

Running this on x86 (GCC 4.9.3): 在x86（GCC 4.9.3）上运行：

#include "stdio.h"
int main(int arc, char **argv)
{
    float a=73;
    float b=19;

    float ans1 = (a*a*a/b/b/b);
    float ans2 = ((double)a*(double)a*(double)a/(double)b/(double)b/(double)b);
    printf("plain: %f\n", ans1);
    printf("cast:  %f\n", ans2);
    return 0;
}

outputs: 输出：

plain: 56.716282
cast:  56.716286

The same operations in a Windows calculator return: Windows计算器中的相同操作将返回：

56.716285172765709287068085726782

Clearly, the second result has greater accuracy. 显然，第二个结果具有更高的准确性。

将单精度浮点数转换为双精度以进行除法

问题描述

4 个解决方案

解决方案1
10 已采纳 2015-07-11 00:58:04

解决方案2
4 2015-07-10 23:20:18

解决方案3
3 2015-07-10 23:14:47

解决方案4
1 2015-07-10 23:20:01

将单精度浮点数转换为双精度以进行除法

问题描述

4 个解决方案

解决方案1 10 已采纳 2015-07-11 00:58:04

解决方案2 4 2015-07-10 23:20:18

解决方案3 3 2015-07-10 23:14:47

解决方案4 1 2015-07-10 23:20:01

解决方案1
10 已采纳 2015-07-11 00:58:04

解决方案2
4 2015-07-10 23:20:18

解决方案3
3 2015-07-10 23:14:47

解决方案4
1 2015-07-10 23:20:01