如果您将一个大整数转换为浮点数会发生什么

Question

this is a general question about what precisely happens when I cast a very big/small SIGNED integer to a floating point using gcc 4.4. 这是一个关于使用gcc 4.4将非常大/很小的SIGNED整数强制转换为浮点数时究竟发生了什么的一般问题。

I see some weird behaviour when doing the casting. 进行投射时，我看到一些奇怪的行为。 Here are some examples: 这里有些例子：

MUSTBE is obtained with this method: 必须通过以下方法获得：

float f = (float)x;
unsigned int r;
memcpy(&r, &f, sizeof(unsigned int));

./btest -f float_i2f -1 0x80800001
input:          10000000100000000000000000000001
absolute value: 01111111011111111111111111111111

exponent:       10011101
mantissa:       00000000011111101111111111111111  (right shifted absolute value)

EXPECT:         11001110111111101111111111111111  (sign|exponent|mantissa)
MUST BE:        11001110111111110000000000000000  (sign ok, exponent ok,
                                                     mantissa???)

./btest -f float_i2f -1 0x3f7fffe0

EXPECT:    01001110011111011111111111111111
MUST BE:   01001110011111100000000000000000

./btest -f float_i2f -1 0x80004999                                                                  


EXPECT:    11001110111111111111111101101100
MUST BE:   11001110111111111111111101101101    (<- 1 added at the end)

So what bothers me that the mantissa is in both examples different then if I just shift my integer value to the right. 因此，令我困扰的是，在两个示例中尾数都不同，然后我只是将整数值向右移动。 The zeros at the end for instance. 例如，结尾处的零。 Where do they come from? 他们来自哪里？

I only see this behaviour on big/small values. 我只在大/小值上看到此行为。 Values in the range -2^24, 2^24 work fine. 范围为-2 ^ 24、2 ^ 24的值可以正常工作。

I wonder if someone can enlighten me what happens here. 我想知道是否有人可以启发我这里发生的事情。 What are the steps too take on very big/small values. 采取什么步骤来实现非常大/小的价值。

This is an add on question to : function to convert float to int (huge integers) which is not as general as this one here. 这是对：函数的一个附加问题：函数将float转换为int（巨大的整数），这不像这里的一般。

EDIT Code: 编辑代码：

unsigned float_i2f(int x) {
  if (x == 0) return 0;
  /* get sign of x */
  int sign = (x>>31) & 0x1;

  /* absolute value of x */
  int a = sign ? ~x + 1 : x;

  /* calculate exponent */
  int e = 158;
  int t = a;
  while (!(t >> 31) & 0x1) {
    t <<= 1;
    e--;
  };

  /* calculate mantissa */
  int m = (t >> 8) & ~(((0x1 << 31) >> 8 << 1));
  m &= 0x7fffff;

  int res = sign << 31;
  res |= (e << 23);
  res |= m;

  return res;
}

EDIT 2: 编辑2：

After Adams remarks and the reference to the book Write Great Code, I updated my routine with rounding. 在亚当斯（Adams）评论和引用《写伟大的代码》一书之后，我用四舍五入更新了我的例程。 Still I get some rounding errors (now fortunately only 1 bit off). 我仍然遇到一些舍入错误（幸运的是现在只有1位错误）。

Now if I do a test run, I get mostly good results but a couple of rounding errors like this: 现在，如果我进行测试运行，我将获得大部分良好的结果，但会出现一些舍入错误，如下所示：

input:  0xfefffff5
result: 11001011100000000000000000000101
GOAL:   11001011100000000000000000000110  (1 too low)

input:  0x7fffff
result: 01001010111111111111111111111111
GOAL:   01001010111111111111111111111110  (1 too high)

unsigned float_i2f(int x) {
  if (x == 0) return 0;
  /* get sign of x */
  int sign = (x>>31) & 0x1;

  /* absolute value of x */
  int a = sign ? ~x + 1 : x;

  /* calculate exponent */
  int e = 158;
  int t = a;
  while (!(t >> 31) & 0x1) {
    t <<= 1;
    e--;
  };

  /* mask to check which bits get shifted out when rounding */
  static unsigned masks[24] = {
    0, 1, 3, 7, 
    0xf, 0x1f, 
    0x3f, 0x7f, 
    0xff, 0x1ff, 
    0x3ff, 0x7ff, 
    0xfff, 0x1fff, 
    0x3fff, 0x7fff, 
    0xffff, 0x1ffff, 
    0x3ffff, 0x7ffff, 
    0xfffff, 0x1fffff, 
    0x3fffff, 0x7fffff
  };

  /* mask to check wether round up, or down */
  static unsigned HOmasks[24] = {
    0,
    1, 2, 4, 0x8, 0x10, 0x20, 0x40, 0x80,
    0x100, 0x200, 0x400, 0x800, 0x1000, 0x2000, 0x4000, 0x8000, 0x10000, 0x20000, 0x40000, 0x80000, 0x100000, 0x200000, 0x400000
  };

  int S = a & masks[8];
  int m = (t >> 8) & ~(((0x1 << 31) >> 8 << 1));
  m &= 0x7fffff;

  if (S > HOmasks[8]) {
    /* round up */
    m += 1;
  } else if (S == HOmasks[8]) {
    /* round down */
    m = m + (m & 1);
  }

  /* special case where last bit of exponent is also set in mantissa
   * and mantissa itself is 0 */
  if (m & (0x1 << 23)) {
    e += 1;
    m = 0;
  }

  int res = sign << 31;
  res |= (e << 23);
  res |= m;
  return res;
}

Does someone have any idea where the problem lies? 有人知道问题出在哪里吗？

Answer 1

A 32-bit float uses some of the bits for the exponent and therefore cannot represent all 32-bit integer values exactly. 32位float会将某些位用于指数，因此不能完全表示所有32位整数值。

A 64-bit double can store any 32-bit integer value exactly. 64位double可以完全存储任何32位整数值。

Wikipedia has an abbreviated entry on IEEE 754 floating point, and lots of details of the internals of floating point numbers at IEEE 754-1985 — the current standard is IEEE 754:2008. Wikipedia在IEEE 754浮点上有一个缩写条目，并且在IEEE 754-1985上有许多浮点数内部的详细信息-当前标准是IEEE 754：2008。 It notes that a 32-bit float uses one bit for the sign, 8 bits for the exponent, leaving 23 explicit and 1 implicit bit for the mantissa, which is why absolute values up to 2 ²⁴ can be represented exactly. 它指出，一个32位浮点型将1位用于符号，将8位用于指数，将23个显式位和1个隐式位用于尾数，这就是为什么可以精确表示2 2 ²⁴绝对值的原因。

I thought that it was clear that a 32 bit integer can't be exactly stored into a 32bit float. 我以为很明显，不能将32位整数完全存储到32位浮点型中。 My question is: What happens IF I store an integer bigger 2^24 or smaller -2^24? 我的问题是：如果我存储一个大于2 ^ 24或更小的-2 ^ 24的整数会发生什么？ And how can I replicate it? 我该如何复制呢？

Once the absolute values are larger than 2 ²⁴ , the integer values cannot be represented exactly in the 24 effective digits of the mantissa of a 32-bit float , so only the leading 24 digits are reliably available. 一旦绝对值大于2 ²⁴ ，就无法在32位float的尾数的24个有效数字中准确表示整数值，因此只有前24个数字可靠地可用。 Floating point rounding also kicks in. 浮点数舍入也开始了。

You can demonstrate with code similar to this: #include #include 您可以使用类似于以下的代码进行演示：#include #include

typedef union Ufloat
{
    uint32_t    i;
    float       f;
} Ufloat;

static void dump_value(uint32_t i, uint32_t v)
{
    Ufloat u = { .i = v };
    printf("0x%.8" PRIX32 ": 0x%.8" PRIX32 " = %15.7e = %15.6A\n", i, v, u.f, u.f);
}

int main(void)
{
    uint32_t lo = 1 << 23;
    uint32_t hi = 1 << 28;
    Ufloat u;

    for (uint32_t v = lo; v < hi; v <<= 1)
    {
        u.f = v;
        dump_value(v, u.i);
    }

    lo = (1 << 24) - 16;
    hi = lo + 64;

    for (uint32_t v = lo; v < hi; v++)
    {
        u.f = v;
        dump_value(v, u.i);
    }

    return 0;
}

Sample output: 样本输出：

0x00800000: 0x4B000000 =   8.3886080e+06 =  0X1.000000P+23
0x01000000: 0x4B800000 =   1.6777216e+07 =  0X1.000000P+24
0x02000000: 0x4C000000 =   3.3554432e+07 =  0X1.000000P+25
0x04000000: 0x4C800000 =   6.7108864e+07 =  0X1.000000P+26
0x08000000: 0x4D000000 =   1.3421773e+08 =  0X1.000000P+27
0x00FFFFF0: 0x4B7FFFF0 =   1.6777200e+07 =  0X1.FFFFE0P+23
0x00FFFFF1: 0x4B7FFFF1 =   1.6777201e+07 =  0X1.FFFFE2P+23
0x00FFFFF2: 0x4B7FFFF2 =   1.6777202e+07 =  0X1.FFFFE4P+23
0x00FFFFF3: 0x4B7FFFF3 =   1.6777203e+07 =  0X1.FFFFE6P+23
0x00FFFFF4: 0x4B7FFFF4 =   1.6777204e+07 =  0X1.FFFFE8P+23
0x00FFFFF5: 0x4B7FFFF5 =   1.6777205e+07 =  0X1.FFFFEAP+23
0x00FFFFF6: 0x4B7FFFF6 =   1.6777206e+07 =  0X1.FFFFECP+23
0x00FFFFF7: 0x4B7FFFF7 =   1.6777207e+07 =  0X1.FFFFEEP+23
0x00FFFFF8: 0x4B7FFFF8 =   1.6777208e+07 =  0X1.FFFFF0P+23
0x00FFFFF9: 0x4B7FFFF9 =   1.6777209e+07 =  0X1.FFFFF2P+23
0x00FFFFFA: 0x4B7FFFFA =   1.6777210e+07 =  0X1.FFFFF4P+23
0x00FFFFFB: 0x4B7FFFFB =   1.6777211e+07 =  0X1.FFFFF6P+23
0x00FFFFFC: 0x4B7FFFFC =   1.6777212e+07 =  0X1.FFFFF8P+23
0x00FFFFFD: 0x4B7FFFFD =   1.6777213e+07 =  0X1.FFFFFAP+23
0x00FFFFFE: 0x4B7FFFFE =   1.6777214e+07 =  0X1.FFFFFCP+23
0x00FFFFFF: 0x4B7FFFFF =   1.6777215e+07 =  0X1.FFFFFEP+23
0x01000000: 0x4B800000 =   1.6777216e+07 =  0X1.000000P+24
0x01000001: 0x4B800000 =   1.6777216e+07 =  0X1.000000P+24
0x01000002: 0x4B800001 =   1.6777218e+07 =  0X1.000002P+24
0x01000003: 0x4B800002 =   1.6777220e+07 =  0X1.000004P+24
0x01000004: 0x4B800002 =   1.6777220e+07 =  0X1.000004P+24
0x01000005: 0x4B800002 =   1.6777220e+07 =  0X1.000004P+24
0x01000006: 0x4B800003 =   1.6777222e+07 =  0X1.000006P+24
0x01000007: 0x4B800004 =   1.6777224e+07 =  0X1.000008P+24
0x01000008: 0x4B800004 =   1.6777224e+07 =  0X1.000008P+24
0x01000009: 0x4B800004 =   1.6777224e+07 =  0X1.000008P+24
0x0100000A: 0x4B800005 =   1.6777226e+07 =  0X1.00000AP+24
0x0100000B: 0x4B800006 =   1.6777228e+07 =  0X1.00000CP+24
0x0100000C: 0x4B800006 =   1.6777228e+07 =  0X1.00000CP+24
0x0100000D: 0x4B800006 =   1.6777228e+07 =  0X1.00000CP+24
0x0100000E: 0x4B800007 =   1.6777230e+07 =  0X1.00000EP+24
0x0100000F: 0x4B800008 =   1.6777232e+07 =  0X1.000010P+24
0x01000010: 0x4B800008 =   1.6777232e+07 =  0X1.000010P+24
0x01000011: 0x4B800008 =   1.6777232e+07 =  0X1.000010P+24
0x01000012: 0x4B800009 =   1.6777234e+07 =  0X1.000012P+24
0x01000013: 0x4B80000A =   1.6777236e+07 =  0X1.000014P+24
0x01000014: 0x4B80000A =   1.6777236e+07 =  0X1.000014P+24
0x01000015: 0x4B80000A =   1.6777236e+07 =  0X1.000014P+24
0x01000016: 0x4B80000B =   1.6777238e+07 =  0X1.000016P+24
0x01000017: 0x4B80000C =   1.6777240e+07 =  0X1.000018P+24
0x01000018: 0x4B80000C =   1.6777240e+07 =  0X1.000018P+24
0x01000019: 0x4B80000C =   1.6777240e+07 =  0X1.000018P+24
0x0100001A: 0x4B80000D =   1.6777242e+07 =  0X1.00001AP+24
0x0100001B: 0x4B80000E =   1.6777244e+07 =  0X1.00001CP+24
0x0100001C: 0x4B80000E =   1.6777244e+07 =  0X1.00001CP+24
0x0100001D: 0x4B80000E =   1.6777244e+07 =  0X1.00001CP+24
0x0100001E: 0x4B80000F =   1.6777246e+07 =  0X1.00001EP+24
0x0100001F: 0x4B800010 =   1.6777248e+07 =  0X1.000020P+24
0x01000020: 0x4B800010 =   1.6777248e+07 =  0X1.000020P+24
0x01000021: 0x4B800010 =   1.6777248e+07 =  0X1.000020P+24
0x01000022: 0x4B800011 =   1.6777250e+07 =  0X1.000022P+24
0x01000023: 0x4B800012 =   1.6777252e+07 =  0X1.000024P+24
0x01000024: 0x4B800012 =   1.6777252e+07 =  0X1.000024P+24
0x01000025: 0x4B800012 =   1.6777252e+07 =  0X1.000024P+24
0x01000026: 0x4B800013 =   1.6777254e+07 =  0X1.000026P+24
0x01000027: 0x4B800014 =   1.6777256e+07 =  0X1.000028P+24
0x01000028: 0x4B800014 =   1.6777256e+07 =  0X1.000028P+24
0x01000029: 0x4B800014 =   1.6777256e+07 =  0X1.000028P+24
0x0100002A: 0x4B800015 =   1.6777258e+07 =  0X1.00002AP+24
0x0100002B: 0x4B800016 =   1.6777260e+07 =  0X1.00002CP+24
0x0100002C: 0x4B800016 =   1.6777260e+07 =  0X1.00002CP+24
0x0100002D: 0x4B800016 =   1.6777260e+07 =  0X1.00002CP+24
0x0100002E: 0x4B800017 =   1.6777262e+07 =  0X1.00002EP+24
0x0100002F: 0x4B800018 =   1.6777264e+07 =  0X1.000030P+24

The first part of the output demonstrates that some integer values can still be stored exactly; 输出的第一部分演示了一些整数值仍然可以准确存储； specifically, powers of 2 can be stored exactly. 具体来说，可以精确存储2的幂。 In fact, more precisely (but less concisely), any integer where binary representation of the absolute value has no more than 24 significant digits (any trailing digits are zeros) can be represented exactly. 实际上，更准确地说（但不太简洁），可以精确表示绝对值的二进制表示不超过24个有效数字（任何尾随数字为零）的任何整数。 The values can't necessarily be printed exactly, but that's a separate issue from storing them exactly. 这些值不一定要精确地打印出来，但这与精确地存储它们是一个单独的问题。

The second (larger) part of the output demonstrates that up to 2 ²⁴ -1, the integer values can be represented exactly. 输出的第二个（较大的）部分表明，最多2 ²⁴ -1，整数值可以精确表示。 The value of 2 ²⁴ itself is also exactly representable, but 2 ²⁴ +1 is not, so it appears the same as 2 ²⁴ . 2 ²⁴本身的值也可以精确表示，但是2 ²⁴ +1不能完全表示，因此它看起来与2 ²⁴相同。 By contrast, 2 ²⁴ +2 can be represented with just 24 binary digits followed by 1 zero and hence can be represented exactly. 相反，2 ²⁴ +2只能用24个二进制数字表示，后跟1个零，因此可以准确表示。 Repeat ad nauseam for increments larger than 2. It looks as though 'round even' mode is in effect; 重复广告恶心，使增量大于2。看起来好像是“舍入”模式； that's why the results show 1 value then 3 values. 这就是为什么结果显示1个值然后3个值的原因。

(I note in passing that there isn't a way to stipulate that the double passed to printf() — converted from float by the rules of default argument promotions (ISO/IEC 9899:2011 §6.5.2.2 Function calls, ¶6) be printed as a float() — the h modifier would logically be used, but is not defined.) （我注意到顺便说一句，没有办法规定传递给printf()的double —是通过默认参数提升规则从float转换的（ISO / IEC 9899：2011§6.5.2.2函数调用，¶6）被打印为float() -逻辑上将使用h修饰符，但未定义。）

Answer 2

C/C++ floats tend to be compatible with the IEEE 754 floating point standard (eg in gcc). C / C ++浮点数倾向于与IEEE 754浮点标准（例如，在gcc中）兼容。 The zeros come from the rounding rules . 零来自舍入规则。

Shifting a number to the right makes some bits from the right-hand side go away. 向右移动数字可以使右侧的一些位消失。 Let's call them guard bits . 我们称它们为guard bits 。 Now let's call HO the highest bit and LO the lowest bit of our number. 现在，将HO称为数字的最高位，将LO称为数字的最低位。 Now suppose that the guard bits are still a part of our number. 现在假设guard bits仍然是我们数字的一部分。 If, for example, we have 3 guard bits it means that the value of our LO bit is 8 (if it is set). 例如，如果我们有3个guard bits则意味着我们的LO位的值为8（如果已设置）。 Now if: 现在，如果：

value of guard bits > 0.5 * value of LO guard bits值> 0.5 * LO值
rounds the number to the smalling possible greater value, ignoring the sign 将数字四舍五入为可能的较小较大值，而忽略符号
value of 'guard bits' == 0.5 * value of LO '保护位'的值== 0.5 * LO值
- use current number value if LO == 0 如果LO == 0，则使用当前数字值
- number += 1 otherwise 数字+ = 1否则
value of guard bits < 0.5 * value of LO guard bits值<0.5 * LO值
- use current number value 使用当前数字值

why do 3 guard bits mean the LO value is 8 ? 为什么3个保护位表示LO值为8？

Suppose we have a binary 8 bit number: 假设我们有一个二进制的8位数字：

weights:    128 64 32 16 8 4 2 1
binary num:   0  0  0  0 1 1 1 1

Let's shift it right by 3 bits: 让我们将其右移3位：

weights:      x x x 128 64 32 16 8 | 4 2 1
binary num:   0 0 0   0  0  0  0 1 | 1 1 1

As you see, with 3 guard bits the LO bit ends up at the 4th position and has a weight of 8. It is true only for the purpose of rounding. 如您所见，具有3个保护位的LO位最终位于第4个位置，权重为8。这仅适用于舍入目的。 The weights have to be 'normalized' afterwards, so that the weight of LO bit becomes 1 again. 之后必须对权重进行“归一化”，以便LO位的权重再次变为1。

And how can I check with bit operations if guard bits > 0.5 * value ?? 以及如果保护位> 0.5 * value，我该如何用位操作检查？

The fastest way is to employ lookup tables. 最快的方法是使用查找表。 Suppose we're working on an 8 bit number: 假设我们正在处理一个8位数字：

unsigned number;          //our number
unsigned bitsToShift;     //number of bits to shift

assert(bitsToShift < 8);  //8 bits

unsigned guardMasks[8] = {0, 1, 3, 7, 0xf, 0x1f, 0x3f}
unsigned LOvalues[8] = {0, 1, 2, 4, 0x8, 0x10, 0x20, 0x40} //divided by 2 for faster comparison

unsigned guardBits = number & guardMasks[bitsToShift]; //value of the guard bits
number = number >> bitsToShift;

if(guardBits > LOvalues[bitsToShift]) {
...
} else if (guardBits == LOvalues[bitsToShift]) {
...
} else { //guardBits < LOvalues[bitsToShift]
...
}

Reference: Write Great Code, Volume 1 by Randall Hyde 参考：Randall Hyde撰写的伟大代码，第1卷

如果您将一个大整数转换为浮点数会发生什么

问题描述

2 个解决方案

解决方案1
3 2014-09-06 14:45:02

解决方案2
1 已采纳 2014-09-06 14:48:27

如果您将一个大整数转换为浮点数会发生什么

问题描述

2 个解决方案

解决方案1 3 2014-09-06 14:45:02

解决方案2 1 已采纳 2014-09-06 14:48:27

解决方案1
3 2014-09-06 14:45:02

解决方案2
1 已采纳 2014-09-06 14:48:27