简体   繁体   中英

16bit Float Multiplication in C

I'm working on a small project, where I need float multiplication with 16bit floats (half precision). Unhappily, I'm facing some problems with the algorithm:

Example Output

1 * 5 = 5
2 * 5 = 10
3 * 5 = 14.5
4 * 5 = 20
5 * 5 = 24.5

100 * 4 = 100
100 * 5 = 482

The Source Code

const int bits = 16;
const int exponent_length = 5;
const int fraction_length = 10;

const int bias = pow(2, exponent_length - 1) - 1;
const int exponent_mask = ((1 << 5) - 1) << fraction_length;
const int fraction_mask = (1 << fraction_length) - 1;
const int hidden_bit = (1 << 10);  // Was 1 << 11 before update 1

int float_mul(int f1, int f2) {
    int res_exp = 0;
    int res_frac = 0;
    int result = 0;

    int exp1 = (f1 & exponent_mask) >> fraction_length;
    int exp2 = (f2 & exponent_mask) >> fraction_length;
    int frac1 = (f1 & fraction_mask) | hidden_bit;
    int frac2 = (f2 & fraction_mask) | hidden_bit;

    // Add exponents
    res_exp = exp1 + exp2 - bias;  // Remove double bias

    // Multiply significants
    res_frac = frac1 * frac2;   // 11 bit * 11 bit → 22 bit!
    // Shift 22bit int right to fit into 10 bit
    if (highest_bit_pos(res_mant) == 21) {
        res_mant >>= 11;
        res_exp += 1;
    } else {
        res_mant >>= 10;
    }
    res_frac &= ~hidden_bit;    // Remove hidden bit

    // Construct float
    return (res_exp << bits - exponent_length - 1) | res_frac;
}

By the way: I'm storing the floats in ints, because I'll try to port this code to some kind of Assembler w/o float point operations later.

The Question

Why does the code work for some values only? Did I forget some normalization or similar? Or does it work only by accident?

Disclaimer: I'm not a CompSci student, it's a leisure project ;)

Update #1

Thanks to the comment by Eric Postpischil I noticed one problem with the code: the hidden_bit flag was off by one (should be 1 << 10 ). With that change, I don't get decimal places any more, but still some calculations are off (eg 3•3=20 ). I assume, it's the res_frac shift as descibred in the answers.

Update #2

The second problem with the code was indeed the res_frac shifting. After update #1 I got wrong results when having 22 bit results of frac1 * frac2 . I've updated the code above with a the corrected shift statement. Thanks to all for every comment and answer! :)

From a cursory look:

  • No attempt is made to determine the location of the high bit in the product. Two 11-bit numbers, each their high bit set, may produce a 21- or 22-bit number. (Example with two-bit numbers: 10 2 •10 2 is 100 2 , three bits, but 11 2 •11 2 is 1001 2 , four bits.)
  • The result is truncated instead of rounded.
  • Signs are ignored.
  • Subnormal numbers are not handled, on input or output.
  • 11 is hardcoded as a shift amount in one place. This is likely incorrect; the correct amount will depend on how the significand is handled for normalization and rounding.
  • In decoding, the exponent field is shifted right by fraction_length . In encoding, it is shifted left by bits - exponent_length - 1 . To avoid bugs, the same expression should be used in both places.

From a more detailed look by chux :

  • res_frac = frac1 * frac2 fails if int is less than 23 bits (22 for the product and one for the sign).

One problem is that you are truncating instead of rounding:

res_frac >>= 11;            // Shift 22bit int right to fit into 10 bit

You should compute res_frac & 0x7ff first, the part of the 22-bit result that your algorithm is about to discard, and compare it to 0x400 . If it is below, truncate. If it is above, round away from zero. If it is equal to 0x400 , round to the even alternative.

This is more a suggestion for how to make it easier to get your code right, rather than analysis of what is wrong with the existing code.

There are a number of steps that are common to some or all of the floating point arithmetic operations. I suggest extracting each into a function that can be written with focus on one issue, and tested separately. Then when you come to write eg multiplication, you only have to deal with the specifics of that operation.

All the operations will be easier working with a structure that has the actual signed exponent, and the full significand in a wider unsigned integer field. If you were dealing with signed numbers, it would also have a boolean for the sign bit.

Here are some sample operations that could be separate functions, at least until you get it working:

unpack: Take a 16 bit float and extract the exponent and significand into a struct.

pack: Undo unpack - deal with dropping the hidden bit, applying the bias the expoent, and combining them into a float.

normalize: Shift the significand and adjust the exponent to bring the most significant 1-bit to a specified bit position.

round: Apply your rounding rules to drop low significance bits. If you want to do IEEE 754 style round-to-nearest, you need a guard digit that is the most significant bit that will be dropped, and an additional bit indicating if there are any one bits of lower significance than the guard bit.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM