简体繁体 English

是否将有符号整数转换为比反向运算便宜的二进制浮点数？

[英]Is casting a signed integer to a binary floating point number cheaper than the inverse operation?

原文 2015-12-03 16:55:24 8 2 c++/ casting/ floating-point/ integer/ cross-platform

I know from articles like " Why you should never cast floats to ints " and many others like it that casting a float to a signed int is expensive. 我从像“ 为什么你永远不应该将浮动内容转换为整数 ”这样的文章中知道，而其他许多人喜欢将浮动内容转换为带符号的int是昂贵的。 I'm also aware that certain conversion instructions or SIMD vector instructions on some architectures can speed the process. 我也知道某些架构上的某些转换指令或SIMD向量指令可以加速该过程。 I'm curious if converting an integer to floating point is also expensive, as all the material I've found on the subject only talks about how expensive it is to convert from floating point to integer. 我很好奇是否将整数转换为浮点也很昂贵，因为我在主题上找到的所有材料都只讨论了从浮点转换为整数的成本。

Before anyone says "Why don't you just test it?" 在有人说“你为什么不试试它？”之前 I'm not talking about performance on a particular architecture, I'm interested in the algorithmic behavior of the conversion across multiple platforms adhering to the IEEE 754-2008 standard. 我不是在讨论特定架构的性能，我对符合IEEE 754-2008标准的多个平台的转换算法行为感兴趣。 Is there something inherent to the algorithm for conversion that affects performance in general? 转换算法中是否存在固有的影响性能？

Intuitively, I would think that conversion from integer to floating point would be easier in general for the following reasons: 直觉上，我认为从整数到浮点的转换通常会更容易，原因如下：

Rounding is only necessary if the precision of the integer exceeds the precision of the binary floating point number, eg 32-bit integer to 32-bit float might require rounding, but 32-bit integer to 64-bit float won't, and neither will a 32-bit integer that only uses 24-bits of precision. 只有当整数的精度超过二进制浮点数的精度时，才需要舍入，例如，32位整数到32位浮点数可能需要舍入，但32位整数到64位浮点数不会，也不会将使用仅使用24位精度的32位整数。
There is no need to check for NAN or +/- INF or +/- 0. 无需检查NAN或+/- INF或+/- 0。
There is no danger of overflow or underflow. 没有上溢或下溢的危险。

What are reasons that conversion from int to float could result in poor cross-platform performance, if any (other than a platform emulating floating point numbers in software)? 从int转换为float的原因是什么原因导致跨平台性能不佳（如果有的话）（除了在软件中模拟浮点数的平台）？ Is conversion from int to float generally cheaper than float to int? 从int到float的转换通常比float更便宜吗？

2 个解决方案

Intel specifies in its "Architectures Optimization Reference Manual" that CVTSI2SD has 3-4 cycles latency (and 1 cycle throughput) on the basic desktop/server line since Core2. 英特尔在其“架构优化参考手册”中指出，自Core2以来， CVTSI2SD在基本桌面/服务器线路上具有3-4个周期延迟（和1个周期吞吐量）。 This can be accepted as a good example. 这可以作为一个很好的例子。

From the hardware point of view, such conversion requires some assistance which makes it fit in reasonable cycle amount, otherwise, it gets too expensive. 从硬件的角度来看，这种转换需要一些帮助，使其适合合理的周期数量，否则，它会变得太昂贵。 A naive but rather good explanation follows. 接下来是一个天真但相当好的解释。 In all consideration, I assume a single CPU clock cycle is enough for an operation like full-width integer adding (but not radically longer!), and all results of previous cycle are applied on cycle boundary. 在所有考虑中，我假设单个CPU时钟周期足以进行全宽整数加法（但不会超长！），并且前一周期的所有结果都应用于周期边界。

The first clock cycle with appropriate hardware assistance ( priority encoder ) gives Count Leading Zeros (CLZ) result among with detecting two special cases: 0 and INT_MIN (MSB set and all other bits clear). 具有适当硬件辅助（优先级编码器）的第一个时钟周期给出了计数前导零（CLZ）结果，其中检测到两种特殊情况：0和INT_MIN（MSB设置和所有其他位清除）。 0 and INT_MIN are better to be processed separately (load constant to destination register and finish). 最好单独处理0和INT_MIN（加载常量到目标寄存器并完成）。 Otherwise, if the input integer was negative, it shall be negated; 否则，如果输入整数为负，则应否定; this usually requires one more cycle (because negation is combination of inversion and adding of a carry bit). 这通常需要一个周期（因为否定是反转和添加进位的组合）。 So, 1-2 cycles are spent. 因此，花费了1-2个周期。

At the same time, it can calculate the biased exponent prediction, based on CLZ result. 同时，它可以根据CLZ结果计算偏差指数预测。 Notice we needn't take care of denormalized values or infinity. 请注意，我们不需要处理非规范化值或无穷大。 (Can we predict CLZ(-x) based on CLZ(x), if x < 0? If we can, this economizes us 1 cycle.) （如果x <0，我们可以根据CLZ（x）预测CLZ（-x）吗？如果可以的话，这可以节省我们1个周期。）

Then, shift is applied (1 cycle again, with barrel shifter ) to place the integer value so its highest 1 is at a fixed position (eg with standard 3 extension bits and 24-bit mantissa, this is bit number 26). 然后，应用移位（再次使用桶形移位器的 1个周期）以放置整数值，使其最高1处于固定位置（例如，使用标准3个扩展位和24位尾数，这是位数26）。 This usage of barrel shifter shall combine of all low bits to the sticky bit (a separate custom barrel shifter instance can be needed; but this is waaaay cheaper than cache megabytes or OoO dispatcher ). 桶形移位器的这种使用应将所有低位组合到粘性位（可能需要单独的自定义桶形移位器实例;但这比缓存兆字节或OoO调度程序便宜）。 Now, up to 3 cycles. 现在，最多3个周期。

Then, rounding is applied. 然后，应用舍入。 Rounding is analyzing, in our case, of 4 lowest current value bits (mantissa LSB, guard, round and sticky), and, OTOH, the current rounding mode and target sign (extracted at cycle 1). 在我们的例子中，舍入分析4个最低电流值位（尾数LSB，保护，圆形和粘性），以及OTOH，当前舍入模式和目标符号（在周期1提取）。 Rounding to zero (RZ) results in ignoring guard/round/sticky bits. 舍入为零（RZ）导致忽略保护/圆/粘位。 Rounding to -∞ (RMI) for positive value and to +∞ (RPI) for negative is the same as to zero. 舍入为-∞（RMI）为正值，+为+∞（RPI）为负与零相同。 Rounding to ∞ of opposite sign results in adding 1 to the main mantissa. 舍入到相反符号的∞会导致在主尾数上加1。 Finally, rounding-to-nearest-ties-to-even (RNE): x000...x011 -> discard; 最后，舍入到最近关系到偶数（RNE）：x000 ... x011 - >丢弃; x101...x111 -> add 1; x101 ... x111 - >加1; 0100 -> discard; 0100 - >丢弃; 1100 -> add 1. If hardware is fast enough to add this result at the same cycle (I guess it's likely), we have up to 4 cycles now. 1100 - >添加1.如果硬件足够快以在同一周期添加此结果（我猜它很可能），我们现在最多有4个周期。

This adding on the previous step can lead in carry (like 1111 -> 10000), so, exponent can increase. 这个加上前一步可以导致进位（如1111 - > 10000），因此，指数可以增加。 The final cycle is to pack sign (from cycle 1), mantissa (to "significand") and biased exponent (calculated on cycle 2 from CLZ result and possibly adjusted with carry from cycle 4). 最后一个循环是打包标志（从周期1开始），尾数（到“有效数字”）和偏差指数（从周期2计算得出CLZ结果，并可能用周期4的进位调整）。 So, 5 cycles now. 所以，现在有5个周期。

Is conversion from int to float generally cheaper than float to int? 从int到float的转换通常比float更便宜吗？

We can estimate the same conversion eg from binary32 to int32 (signed). 我们可以估计相同的转换，例如从binary32到int32（signed）。 Let's assume that conversion of NaN, INF or too big value results in fixed value, say, INT_MIN (-2147483648). 假设NaN，INF或太大值的转换导致固定值，例如INT_MIN（-2147483648）。 In that case: 在这种情况下：

Split and analyze the input value: S - sign; 拆分并分析输入值：S - 符号; BE - biased exponent; BE偏向指数; M - mantissa (significand); M - 尾数（有效数字）; also apply rounding mode. 也适用于舍入模式。 A "conversion impossible" (overflow or invalid) signal is generated if: BE >= 158 (this includes NaN and INF). 如果出现以下情况，则会生成“无法转换”（溢出或无效）信号：BE> = 158（包括NaN和INF）。 A "zero" signal is generated if BE < 127 (abs(x) < 1) and {RZ, or (x > 0 and RMI), or (x < 0 and RPI)}; 如果BE <127（abs（x）<1）和{RZ，或（x> 0和RMI），或（x <0和RPI）}，则产生“零”信号; or, if BE < 126 (abs(x) < 0.5) with RNE; 或者，如果BE <126（abs（x）<0.5）与RNE; or, BE = 126, significand = 0 (without hidden bit) and RNE. 或者，BE = 126，有效数= 0（没有隐藏位）和RNE。 Otherwise, signals for final +1 or -1 can be generated for cases: BE < 127 and: x < 0 and RMI; 否则，可以为以下情况生成最终+1或-1的信号：BE <127且：x <0和RMI; x > 0 and RPI; x> 0和RPI; BE = 126 and RNE. BE = 126和RNE。 All these signals can be calculated during one cycle using boolean logic circuitry, and lead to finalize result at the first cycle. 所有这些信号都可以在一个周期内使用布尔逻辑电路计算，并导致在第一个周期完成结果。 In parallel and independently, calculate 157-BE using a separate adder for using at cycle 2. 并行且独立地，使用单独的加法器计算157-BE以在第2周期使用。

If not finalized yet, we have abs(x) >= 1, so, BE >= 127, but BE <= 157 (so abs(x) < 2**31). 如果还没有最终确定，我们有abs（x）> = 1，因此，BE> = 127，但BE <= 157（所以abs（x）<2 ** 31）。 Get 157-BE from cycle 1, this is needed shift amount. 从第1周期获得157-BE，这是需要的转移量。 Apply the right shift by this amount, using the same barrel shifter, as in int -> float algorithm, to a value with (again) 3 additional bits and sticky bit gathering. 使用相同的桶形移位器（如int - > float算法）应用右移，使用相同的桶形移位器，使用（再次）3个附加位和粘性位收集的值。 Here, 2 cycles are spent. 这里花费了2个周期。

Apply rounding (see above). 应用四舍五入（见上文）。 3 cycles spent, and carry can be produced. 花费3个周期，并且可以生产携带物。 Here, we can again detect integer overflow and produce the respective result value. 在这里，我们可以再次检测整数溢出并产生相应的结果值。 Forget additional bits, only 31 bits are valued now. 忘记额外的位，现在只值31位。

Finally, negate the resulting value, if x was negative (sign=1). 最后，如果x为负（sign = 1），则否定结果值。 Up to 4 cycles spent. 最多花费4个周期。

I'm not an experienced binary logic developer so could miss some chance to compact this sequence, but it looks rather close to Intel values. 我不是一个经验丰富的二进制逻辑开发人员，所以可能会错过一些机会来压缩这个序列，但它看起来非常接近英特尔的价值观。 So, the conversions themselves are quite cheaper, provided hardware assistance is present (saying again, it results in no more than a few thousand gates, so is tiny for the contemporary chip production). 因此，如果存在硬件辅助，转换本身会相当便宜（再说一遍，它导致不超过几千个门，因此对于当代芯片生产而言微不足道）。

You can also take a look at Berkeley Softfloat library - it implements virtually the same approach with minor modifications. 您还可以查看Berkeley Softfloat库 - 它通过微小的修改实现了几乎相同的方法。 Start with ui32_to_f32.c source file. 从ui32_to_f32.c源文件开始。 They use more additional bits for intermediate values, but this isn't principal. 它们为中间值使用更多的附加位，但这不是主要的。

See @Netch's excellent answer re the algorithm, but it's not just the algorithm. 请参阅@ Netch的优秀答案算法，但不仅仅是算法。 The FPU runs asynchronously, so the int->FP operation can start and the CPU can then execute the next instruction. FPU异步运行，因此int-> FP操作可以启动，然后CPU可以执行下一条指令。 But when storing FP to integer, there has to be an FWAIT (Intel). 但是当将FP存储到整数时，必须有一个FWAIT（英特尔）。