将浮点值舍入为单精度

Question

C and C++ provide floating-point data types of several widths, but they leave precision unspecified. C和C ++提供了几种宽度的浮点数据类型，但是它们的精度未指定。 The compiler is free to use idealized arithmetic to simplify expressions, to use double precision in computing an expression over float values, or to use a double-precision register to keep the value of a float variable or common subexpression. 编译器可以自由地使用理想化的算术来简化表达式，可以在float值上使用双精度计算表达式，也可以使用双精度寄存器来保留float变量或公共子表达式的值。

Correct me if I'm wrong ^{is wrong, see edit} , but it's even legal to hoist a float in memory into a double-precision register, so storing a value and then loading it back doesn't necessarily truncate bits. 如果我错^了，请纠正我^{，请参见edit} ，但是将内存中的float提升到双精度寄存器中甚至是合法的，因此存储值然后将其加载回并不一定会截断位。

What is the safest, most portable way to convert a number to a lower precision? 将数字转换为较低精度的最安全，最便携的方法是什么？ Ideally, it should be efficient too, compiling to cvtsd2ss on SSE2. 理想情况下，它也应该高效，可以在SSE2上编译为cvtsd2ss 。 (So, while volatile may be an answer, I'd prefer something better.) （因此，虽然volatile可能是一个答案，但我希望有更好的选择。）

Edit: Summarizing some of the comments and findings… 编辑：总结一些评论和发现…

Wider precision for intermediate results is always fair game. 更高的中间结果精度始终是公平的游戏。
Expression simplification is allowed in C++, and in C given FP_CONTRACT on . 在C ++中和在C上给定FP_CONTRACT on ，都可以简化表达式。
Using double precision for a single-precision float is not allowed (in C or C++). 使用双精度为单精度float是不允许的（在C或C ++）。

However, some compilers (particularly GCC on x86-32) illegally forget some precision conversions. 但是，某些编译器（尤其是x86-32上的GCC）非法地忘记了一些精度转换。

Edit 2: Some folks are expressing doubt as to the conformance of failing to narrow intermediate results. 编辑2：有些人对未能缩小中间结果的一致性表示怀疑。

C11 §5.2.4.2.2/9 (same as the C99 ref cited in the answer) is specific about "remove all extra range and precision" because it specifies how other computations may be done in wider precision. C11§5.2.4.2.2/ 9（与答案中引用的C99参考文献相同）专门针对“删除所有额外的范围和精度”，因为它指定了如何以更高的精度进行其他计算。 Among several conforming alternative precisions is "indeterminable," which to me means no constraint whatsoever. 在几种相符的替代精度中，“不确定的”是不确定的，对我而言，这意味着没有任何约束。
C11 §7.12.2 and §6.5/8 defines #pragma STDC FP_CONTRACT on which enables the compiler to use infinite precision where possible. C11§7.12.2和§6.5/ 8定义了#pragma STDC FP_CONTRACT on ，使编译器在可能的地方使用无限精度。

The intermediate operations in the contracted expression are evaluated as if to infinite range and precision, while the final operation is rounded to the format determined by the expression evaluation method. 收缩表达式中的中间运算将被评估为无限范围和精确度，而最终运算将四舍五入为表达式评估方法确定的格式。 A contracted expression might also omit the raising of floating-point exceptions. 收缩表达式也可能会忽略浮点异常的引发。
C++14 likewise specifically waives the constraints of finite precision and range on intermediate results. C ++ 14同样明确地放弃了对中间结果的有限精度和范围的限制。 N4567 §5/12: N4567§5/ 12：

The values of the floating operands and the results of floating expressions may be represented in greater precision and range than that required by the type; 浮点操作数的值和浮点表达式的结果可以比类型所需的精度和范围大。 the types are not changed thereby. 类型不会因此改变。

Note that allowing the identity x - x = 0 to simplify a + b - b + c into a + c is not the same as making addition commutative or associative. 请注意，允许恒等式x - x = 0将a + b - b + c简化为a + c与使加法可交换或关联不相同。 a + b + c is still not the same as a + c + b or a + (b + c) , when the CPU only provides addition with two addends and a rounded result. 当CPU仅提供带有两个加数和取整结果的加法运算时， a + b + c仍与a + c + b或a + (b + c) 。

Answer 1

The C99 5.2.4.2.2p8 excplicitly says that C99 5.2.4.2.2p8明确表示

assignment and cast [..] remove all extra range and precision 分配并强制转换[..]删除所有额外的范围和精度

So, if you want to limit the range and precision to that of a float, just cast to float , or assign to a float variable. 因此，如果要将范围和精度限制为float的范围和精度，只需将其float转换为float或分配给float变量即可。

You can even do stuff like (double)((float)d) (with extra parentheses to make sure humans read it correctly), limiting a variable d to float precision and range, then casting it back to double . 您甚至可以执行类似(double)((float)d) （带有额外的括号以确保人类可以正确读取它），将变量d限制为float精度和范围，然后将其强制转换为double 。 (A standard C compiler is NOT allowed to optimize that away even if d is a double ; it must limit the precision and range to that of a float .) （即使d是double精度数，也不允许标准C编译器对其进行优化；它必须将精度和范围限制为float 。）

I've used this in practical implementations of eg Kahan summation algorithm , where it can be utilized to allow the C compiler to do very aggressive optimization, but without risk of invalidation. 我已经在Kahan求和算法的实际实现中使用了该算法，该算法可用于允许C编译器进行非常积极的优化，但是没有无效的风险。

Answer 2

I'm not so sure I share your fear here ... I tried this glorified cast-as-a-function: 我不确定在这里是否也与您一样担心...我尝试了这种功能强大的强制转换功能：

float to_float(double x)
{
  return (float) x;
}

when entered into the Compiler Explorer , I get this: 当输入Compiler Explorer时，我得到以下信息：

to_float(double):
        push     rbp
        mov      rbp, rsp
        movsd    QWORD PTR [rbp-8], xmm0
        cvtsd2ss xmm0, QWORD PTR [rbp-8]
        pop      rbp
        ret

That seems to generate the requested opcode ( cvtsd2ss ) right away, and I didn't even enter any compiler options to force SSE2 or anything. 这似乎立即生成了所请求的操作码（ cvtsd2ss ），我什至没有输入任何编译器选项来强制SSE2或其他任何操作。

I'd say that a cast has to convert to the target type, the compiler isn't free to ignore casts as far as I know. 我会说强制转换必须转换为目标类型，据我所知，编译器不能随意忽略强制转换。

Can you provide some case where you think the compiler can ignore a cast, that you've seen happen? 您能否提供一些您认为编译器可以忽略已发生的强制转换的情况？ Perhaps there's undefined behavior of some kind lurking in the code, which makes the compiler take unexpected shortcuts. 也许在代码中存在某种潜伏的未定义行为，这使编译器采用了意想不到的快捷方式。

将浮点值舍入为单精度

问题描述

2 个解决方案

解决方案1
4 已采纳 2016-11-24 14:39:09

解决方案2
1 2016-11-24 10:45:06

将浮点值舍入为单精度

问题描述

2 个解决方案

解决方案1 4 已采纳 2016-11-24 14:39:09

解决方案2 1 2016-11-24 10:45:06

解决方案1
4 已采纳 2016-11-24 14:39:09

解决方案2
1 2016-11-24 10:45:06