简体   繁体   中英

what happens at background when convert int to float

I have some no understanding about how one can cast int to float, step by step? Assume I have a signed integer number which is in binary format. Moreover, I want cast it to float by hand. However, I can't. Thus, CAn one show me how to do that conversion step by step?

I do that conversion in c, many times ? like;

  int a = foo ( );
  float f = ( float ) a ;

But, I haven't figure out what happens at background. Moreover, To understand well, I want do that conversion by hand.

EDIT: If you know much about conversion, you can also give information about for float to double conversion. Moreover, for float to int

Floating point values (IEEE754 ones, anyway) basically have three components:

  • a sign s ;
  • a series of exponent bits e ; and
  • a series of mantissa bits m .

The precision dictates how many bits are available for the exponent and mantissa. Let's examine the value 0.1 for single-precision floating point:

s eeeeeeee mmmmmmmmmmmmmmmmmmmmmmm    1/n
0 01111011 10011001100110011001101
           ||||||||||||||||||||||+- 8388608
           |||||||||||||||||||||+-- 4194304
           ||||||||||||||||||||+--- 2097152
           |||||||||||||||||||+---- 1048576
           ||||||||||||||||||+-----  524288
           |||||||||||||||||+------  262144
           ||||||||||||||||+-------  131072
           |||||||||||||||+--------   65536
           ||||||||||||||+---------   32768
           |||||||||||||+----------   16384
           ||||||||||||+-----------    8192
           |||||||||||+------------    4096
           ||||||||||+-------------    2048
           |||||||||+--------------    1024
           ||||||||+---------------     512
           |||||||+----------------     256
           ||||||+-----------------     128
           |||||+------------------      64
           ||||+-------------------      32
           |||+--------------------      16
           ||+---------------------       8
           |+----------------------       4
           +-----------------------       2

The sign is positive, that's pretty easy.

The exponent is 64+32+16+8+2+1 = 123 - 127 bias = -4 , so the multiplier is 2 -4 or 1/16 . The bias is there so that you can get really small numbers (like 10 -30 ) as well as large ones.

The mantissa is chunky. It consists of 1 (the implicit base) plus (for all those bits with each being worth 1/(2 n ) as n starts at 1 and increases to the right), {1/2, 1/16, 1/32, 1/256, 1/512, 1/4096, 1/8192, 1/65536, 1/131072, 1/1048576, 1/2097152, 1/8388608} .

When you add all these up, you get 1.60000002384185791015625 .

When you multiply that by the 2 -4 multiplier, you get 0.100000001490116119384765625 , which is why they say you cannot represent 0.1 exactly as an IEEE754 float.

In terms of converting integers to floats, if you have as many bits in the mantissa (including the implicit 1), you can just transfer the integer bit pattern over and select the correct exponent. There will be no loss of precision. For example a double precision IEEE754 (64 bits, 52/53 of those being mantissa) has no problem taking on a 32-bit integer.

If there are more bits in your integer (such as a 32-bit integer and a 32-bit single precision float, which only has 23/24 bits of mantissa) then you need to scale the integer.

This involves stripping off the least significant bits (rounding actually) so that it will fit into the mantissa bits. That involves loss of precision of course but that's unavoidable.


Let's have a look at a specific value, 123456789 . The following program dumps the bits of each data type.

#include <stdio.h>

static void dumpBits (char *desc, unsigned char *addr, size_t sz) {
    unsigned char mask;
    printf ("%s:\n  ", desc);
    while (sz-- != 0) {
        putchar (' ');
        for (mask = 0x80; mask > 0; mask >>= 1, addr++)
            if (((addr[sz]) & mask) == 0)
                putchar ('0');
            else
                putchar ('1');
    }
    putchar ('\n');
}

int main (void) {
    int intNum = 123456789;
    float fltNum = intNum;
    double dblNum = intNum;

    printf ("%d %f %f\n",intNum, fltNum, dblNum);
    dumpBits ("Integer", (unsigned char *)(&intNum), sizeof (int));
    dumpBits ("Float", (unsigned char *)(&fltNum), sizeof (float));
    dumpBits ("Double", (unsigned char *)(&dblNum), sizeof (double));

    return 0;
}

The output on my system is as follows:

123456789 123456792.000000 123456789.000000
integer:
   00000111 01011011 11001101 00010101
float:
   01001100 11101011 01111001 10100011
double:
   01000001 10011101 01101111 00110100 01010100 00000000 00000000 00000000

And we'll look at these one at a time. First the integer, simple powers of two:

   00000111 01011011 11001101 00010101
        |||  | || || ||  || |    | | +->          1
        |||  | || || ||  || |    | +--->          4
        |||  | || || ||  || |    +----->         16
        |||  | || || ||  || +---------->        256
        |||  | || || ||  |+------------>       1024
        |||  | || || ||  +------------->       2048
        |||  | || || |+---------------->      16384
        |||  | || || +----------------->      32768
        |||  | || |+------------------->      65536
        |||  | || +-------------------->     131072
        |||  | |+---------------------->     524288
        |||  | +----------------------->    1048576
        |||  +------------------------->    4194304
        ||+---------------------------->   16777216
        |+----------------------------->   33554432
        +------------------------------>   67108864
                                         ==========
                                          123456789

Now let's look at the single precision float. Notice the bit pattern of the mantissa matching the integer as a near-perfect match:

mantissa:       11 01011011 11001101 00011    (spaced out).
integer:  00000111 01011011 11001101 00010101 (untouched).

There's an implicit 1 bit to the left of the mantissa and it's also been rounded at the other end, which is where that loss of precision comes from (the value changing from 123456789 to 123456792 as in the output from that program above).

Working out the values:

s eeeeeeee mmmmmmmmmmmmmmmmmmmmmmm    1/n
0 10011001 11010110111100110100011
           || | || ||||  || |   |+- 8388608
           || | || ||||  || |   +-- 4194304
           || | || ||||  || +------  262144
           || | || ||||  |+--------   65536
           || | || ||||  +---------   32768
           || | || |||+------------    4096
           || | || ||+-------------    2048
           || | || |+--------------    1024
           || | || +---------------     512
           || | |+-----------------     128
           || | +------------------      64
           || +--------------------      16
           |+----------------------       4
           +-----------------------       2

The sign is positive. The exponent is 128+16+8+1 = 153 - 127 bias = 26 , so the multiplier is 2 26 or 67108864 .

The mantissa is 1 (the implicit base) plus (as explained above), {1/2, 1/4, 1/16, 1/64, 1/128, 1/512, 1/1024, 1/2048, 1/4096, 1/32768, 1/65536, 1/262144, 1/4194304, 1/8388608} . When you add all these up, you get 1.83964955806732177734375 .

When you multiply that by the 2 26 multiplier, you get 123456792 , the same as the program output.

The double bitmask output is:

s eeeeeeeeeee mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
0 10000011001 1101011011110011010001010100000000000000000000000000

I am not going to go through the process of figuring out the value of that beast :-) However, I will show the mantissa next to the integer format to show the common bit representation:

mantissa:       11 01011011 11001101 00010101 000...000 (spaced out).
integer:  00000111 01011011 11001101 00010101           (untouched).

You can once again see the commonality with the implicit bit on the left and the vastly greater bit availability on the right, which is why there's no loss of precision in this case.


In terms of converting between floats and doubles, that's also reasonably easy to understand.

You first have to check the special values such as NaN and the infinities. These are indicated by special exponent/mantissa combinations and it's probably easier to detect these up front ang generate the equivalent in the new format.

Then in the case where you're going from double to float, you obviously have less of a range available to you since there are less bits in the exponent. If your double is outside the range of a float, you need to handle that.

Assuming it will fit, you then need to:

  • rebase the exponent (the bias is different for the two types).
  • copy as many bits from the mantissa as will fit (rounding if necessary).
  • padding out the rest of the target mantissa (if any) with zero bits.

Conceptionally this is quite simple. A float (in IEEE 754-1985) has the following representation:

  • 1 bit sign
  • 8 bits exponent (0 means denormalized numbers, 1 means -126, 127 means 0, 255 means infinity)
  • 23 bits mantissa (the part that follows the "1.")

So basically it's roughly:

  • determine the sign and the magnitude of the number
  • find the 24 most significand bits, properly rounded
  • adjust the exponent
  • encode these three parts into the 32 bits form

When implementing your own conversion, it's easy to test, since you can just compare the results to the builtin type conversion operator.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM