当 64 位 int 在 C/C++ 中转换为 64 位浮点数并且没有完全匹配时，它是否总是落在非小数上？

Question

When int64_t is cast to double and doesn't have an exact match, to my knowledge I get a sort of best-effort-nearest-value equivalent in double.当 int64_t 被强制转换为 double 并且没有完全匹配时，据我所知，我得到了一种等效于 double 的尽力而为最近值。 For example, 9223372036854775000 in int64_t appears to end up as 9223372036854774784.0 in double:例如， int64_t 中的 9223372036854775000 似乎以双精度形式结束：

#include <stdio.h>

int main(int argc, const char **argv) {
    printf("Corresponding double: %f\n", (double)9223372036854775000LL);
    // Outputs: 9223372036854774784.000000
    return 0;
}

It appears to me as if an int64_t cast to a double always ends up on as a clean non-fractional number, even in this higher number range where double has really low precision.在我看来，好像将 int64_t 强制转换为 double 总是以干净的非小数结尾，即使在 double 精度非常低的更高数字范围内也是如此。 However, I just observed this from random attempts.但是，我只是从随机尝试中观察到这一点。 Is this guaranteed to happen for any value of int64_t cast to a double?对于任何转换为双精度的 int64_t 值，是否保证会发生这种情况？

And if I cast this non-fractional double back to int64_t, will I always get the exact corresponding 64bit int with the.0 chopped off?如果我将这个非小数双精度转换回 int64_t，我是否总是会得到精确对应的 64 位 int 并将 .0 切掉？ ( Assuming it doesn't overflow during the conversion back. ) Like here: （假设它在转换回来的过程中没有溢出。 ）就像这里：

#include <inttypes.h>
#include <stdio.h>

int main(int argc, const char **argv) {
    printf("Corresponding double: %f\n", (double)9223372036854775000LL);
    // Outputs: 9223372036854774784.000000
    printf("Corresponding int to corresponding double: %" PRId64 "\n",
           (int64_t)((double)9223372036854775000LL));
    // Outputs: 9223372036854774784
    return 0;
}

Or can it be imprecise and get me the "wrong" int in some corner cases?还是在某些极端情况下它可能不精确并让我得到“错误”的整数？

Intuitively and from my tests the answer to both points appears to be "yes", but if somebody with a good formal understanding of the floating point standards and the maths behind it could confirm this that would be really helpful to me.直观地说，从我的测试来看，这两点的答案似乎都是“是”，但如果有人对浮点标准及其背后的数学有很好的正式理解，那么这对我来说真的很有帮助。 I would also be curious if any known more aggressive optimizations like gcc's -Ofast are known to break any of this.如果已知任何已知的更积极的优化（如 gcc 的-Ofast ）会破坏其中任何一个，我也会很好奇。

Answer 1

In general case yes, both should be true.在一般情况下是的，两者都应该是真的。 The floating point base needs to be - if not 2, then at least integer and given that, an integer converted to nearest floating point value can never produce non-zero fractions - either the precision suffices or the lowest-order integer digits in the base of the floating type would be zeroed. The floating point base needs to be - if not 2, then at least integer and given that, an integer converted to nearest floating point value can never produce non-zero fractions - either the precision suffices or the lowest-order integer digits in the base浮动类型的将被归零。 For example in your case your system uses ISO/IEC/IEEE 60559 binary floating point numbers.例如，在您的情况下，您的系统使用 ISO/IEC/IEEE 60559 二进制浮点数。 When inspected in base 2, it can be seen that the trailing digits of the value are indeed zeroed:在以 2 为底进行检查时，可以看出该值的尾随数字确实为零：

>>> bin(9223372036854775000)
'0b111111111111111111111111111111111111111111111111111110011011000'
>>> bin(9223372036854774784)
'0b111111111111111111111111111111111111111111111111111110000000000'

The conversion of a double without fractions to an integer type, given that the value of the double falls within the range of the integer type should be exact...考虑到 double 的值落在 integer 类型的范围内，将不带小数的 double 转换为 integer 类型应该是精确的......

Though you still might encounter a quality-of-implementation issue, or an outright bug - for example MSVC currently has a compiler bug where a round-trip conversion of unsigned 32-bit value with MSB set (or just double value between 2³¹ and 2³²-1 converted to unsigned int ) would "overflow" in the conversion and always result in exactly 2³¹.尽管您仍然可能会遇到实现质量问题或彻底的错误 - 例如， MSVC当前有一个编译器错误，其中设置了 MSB 的无符号 32 位值的往返转换（或只是 2³¹ 和 2³² 之间的双精度值-1 转换为unsigned int ) 将在转换中“溢出”，并且总是导致正好 2³¹。

Answer 2

The following assumes the value being converted is positive.以下假设被转换的值为正。 The behavior of negative numbers is analogous.负数的行为是类似的。

C 2018 6.3.1.4 2 specifies conversions from integer to real and says: C 2018 6.3.1.4 2 指定从 integer 到真实的转换并说：

… If the value being converted is in the range of values that can be represented but cannot be represented exactly, the result is either the nearest higher or nearest lower representable value, chosen in an implementation-defined manner. …如果要转换的值在可以表示但不能精确表示的值范围内，则结果是最接近的较高或最近的较低可表示值，以实现定义的方式选择。

This tells us that some integer value x being converted to floating-point can produce a non-integer only if one of the two representable values bounding x is not an integer and x is not representable.这告诉我们，仅当边界x的两个可表示值之一不是 integer 并且x不可表示时，某些 integer 值x被转换为浮点数才能产生非整数。

5.2.4.2.2 specifies the model used for floating-point numbers. 5.2.4.2.2 指定了用于浮点数的 model。 Each finite floating-point number is represented by a sequence of digits in a certain base b scaled by b ^e for some exponent e .每个有限浮点数都由某个基数b中的数字序列表示，该数字序列针对某个指数e由b ^e缩放。 ( b is an integer greater than 1.) Then, if one of the two values bounding x , say p is not an integer, the scaling must be such that the lowest digit in that floating-point number represents a fraction. （ b是大于 1 的 integer 。）然后，如果限制x的两个值之一，例如p不是 integer，则缩放比例必须使得该浮点数中的最低位表示分数。 But if this is the case, then setting all of the digits in p that represent fractions to 0 must produce a new floating-point number that is an integer.但如果是这种情况，则将p中表示分数的所有数字设置为 0 必须生成一个新的浮点数，即 integer。 If x < p , this integer must be x , and therefore x is representable in the floating-point format.如果x < p ，这个 integer 必须是x ，因此x可以用浮点格式表示。 On the other hand, if p < x , we can add enough to each digit that represents a fraction to make it 0 (and produce a carry to the next higher digit).另一方面，如果p < x ，我们可以将足够的数字加到表示分数的每个数字上，使其为 0（并产生下一个更高数字的进位）。 This will also produce an integer representable in the floating-point type ¹ , and it must be x .这也将产生一个 integer 可表示为浮点类型¹ ，它必须是x 。

Therefore, if conversion of an integer x to the floating-point type would produce a non-integer, x must be representable in the type.因此，如果将 integer x转换为浮点类型会产生非整数，则x必须可以在该类型中表示。 But then conversion to the floating-point type must produce x .但是随后转换为浮点类型必须产生x 。 So it is never possible to produce a non-integer.所以永远不可能产生一个非整数。

Footnote脚注

¹ It is possible this will carry out of all the digits, as when applying it to a three-digit decimal number 9.99, which produces 10.00. ¹这可能会执行所有数字，例如将其应用于三位十进制数 9.99，产生 10.00。 In this case, the value produced is the next power of b , if it is in range of the floating-point format.在这种情况下，如果它在浮点格式的范围内，则生成的值是b的下一个幂。 If it is not, the C standard does not define the behavior.如果不是，则 C 标准未定义该行为。 Also note the C standard sets minimum requirements on the range that floating-point formats must support which preclude any format from not being able to represent 1, which avoids a degenerate case in which a conversion could produce a number like.999 because it was the largest representable finite value.另请注意，C 标准对浮点格式必须支持的范围设置了最低要求，这排除了任何格式无法表示 1，这避免了转换可能产生类似 999 的数字的退化情况，因为它是最大可表示的有限值。

Answer 3

When a 64bit int is cast to 64bit float... and doesn't have an exact match, will it always land on a non-fractional number?当 64 位int被转换为 64 位浮点数......并且没有完全匹配时，它是否总是落在非小数上？
Is this guaranteed to happen for any value of int64_t cast to a double ?对于任何转换为double的int64_t值，这是否保证会发生？

For common double : Yes, it always land on a non-fractional number对于common double ：是的，它总是落在一个非小数上

When there is no match, the result is the closest floating point representable value above or below, depending on rounding mode.当不匹配时，结果是上面或下面最接近的浮点可表示值，具体取决于舍入模式。 Given the characteristics of common double , these 2 bounding values are also whole numbers.鉴于 common double的特性，这两个边界值也是整数。 When the value is not representable, there is first a nearby whole number one.当该值不可表示时，首先有一个附近的整数 1。

... if I cast this non-fractional double back to int64_t , will I always get the exact corresponding 64bit int with the.0 chopped off? ...如果我将这个非小数double精度转换回int64_t ，我是否总是会得到精确对应的 64 位int并将 .0 切掉？

No. Edge cases near INT64_MAX fail as the converted value could become a FP value above INT64_MAX .不会。 INT64_MAX附近的边缘情况会失败，因为转换后的值可能会变成高于INT64_MAX的 FP 值。 Then conversion back to the integer type incurs: "the new type is signed and the value cannot be represented in it; either the result is implementation-defined or an implementation-defined signal is raised."然后转换回 integer 类型会导致：“新类型是有符号的，并且值不能在其中表示；结果是实现定义的，或者引发了实现定义的信号。” C17dr § 6.3.1.3 3 C17dr § 6.3.1.3 3

#include <limits.h>
#include <string.h>

int main() {
  long long imaxm1 = LLONG_MAX - 1;
  double max = (double) imaxm1;
  printf("%lld\n%f\n", imaxm1, max);
  long long imax = (long long) max;
  printf("%lld\n", imax);
}

9223372036854775806
9223372036854775808.000000
9223372036854775807  // Value here is implementation defined.

Deeper exceptions更深层次的例外

(Question variation) When an N bit integer type is cast to a floating point and doesn't have an exact match, will it always land on a non-fractional number? （问题变体）当 N 位 integer 类型被强制转换为浮点并且没有精确匹配时，它是否总是落在非小数上？

Integer type range exceeds finite float point Integer 类型范围超过有限浮点

Conversion to infinity: With common float , and uint128_t , UINT128_MAX converts to infinity .转换为无穷大：使用常见的float和uint128_t ， UINT128_MAX转换为无穷大。 This is readily possible with extra wide integer types.这很容易通过超宽 integer 类型实现。

int main() {
  unsigned __int128  imaxm1 = 0xFFFFFFFFFFFFFFFF;
  imaxm1 <<= 64;
  imaxm1 |= 0xFFFFFFFFFFFFFFFF;
  double fmax = (float) imaxm1;
  double max = (double) imaxm1;
  printf("%llde27\n%f\n%f\n", (long long) (imaxm1/1000000000/1000000000/1000000000), 
    fmax, max);
}

340282366920e27
inf
340282366920938463463374607431768211456.000000

Floating point precession deep more than range浮点进动深度超过范围

On some unicorn implementation, with very wide FP precision and small range, the largest finite could, in theory, not practice, be a non-whole number.在某些 unicorn 实现中，FP 精度非常宽且范围小，最大的有限项在理论上（而不是实践）可能是非整数。 Then with an even wider integer type, the conversion could result in this non-whole number value.然后使用更宽的 integer 类型，转换可能会导致这个非整数值。 I do not see this as a legit concern of OP's.我不认为这是 OP 的合法问题。

当 64 位 int 在 C/C++ 中转换为 64 位浮点数并且没有完全匹配时，它是否总是落在非小数上？

问题描述

3 个解决方案

解决方案1
5 已采纳 2021-01-16 11:27:34

解决方案2
4 2021-01-16 12:58:13

Footnote脚注

解决方案3
1 2021-01-17 22:52:49

当 64 位 int 在 C/C++ 中转换为 64 位浮点数并且没有完全匹配时，它是否总是落在非小数上？

问题描述

3 个解决方案

解决方案1 5 已采纳 2021-01-16 11:27:34

解决方案2 4 2021-01-16 12:58:13

Footnote脚注

解决方案3 1 2021-01-17 22:52:49

解决方案1
5 已采纳 2021-01-16 11:27:34

解决方案2
4 2021-01-16 12:58:13

解决方案3
1 2021-01-17 22:52:49