C中的16位浮点乘法

Question

I'm working on a small project, where I need float multiplication with 16bit floats (half precision). 我正在做一个小项目，我需要16位浮点数（半精度）的浮点乘法。 Unhappily, I'm facing some problems with the algorithm: 不幸的是，我在算法上遇到了一些问题：

Example Output 示例输出

1 * 5 = 5
2 * 5 = 10
3 * 5 = 14.5
4 * 5 = 20
5 * 5 = 24.5

100 * 4 = 100
100 * 5 = 482

The Source Code 源代码

const int bits = 16;
const int exponent_length = 5;
const int fraction_length = 10;

const int bias = pow(2, exponent_length - 1) - 1;
const int exponent_mask = ((1 << 5) - 1) << fraction_length;
const int fraction_mask = (1 << fraction_length) - 1;
const int hidden_bit = (1 << 10);  // Was 1 << 11 before update 1

int float_mul(int f1, int f2) {
    int res_exp = 0;
    int res_frac = 0;
    int result = 0;

    int exp1 = (f1 & exponent_mask) >> fraction_length;
    int exp2 = (f2 & exponent_mask) >> fraction_length;
    int frac1 = (f1 & fraction_mask) | hidden_bit;
    int frac2 = (f2 & fraction_mask) | hidden_bit;

    // Add exponents
    res_exp = exp1 + exp2 - bias;  // Remove double bias

    // Multiply significants
    res_frac = frac1 * frac2;   // 11 bit * 11 bit → 22 bit!
    // Shift 22bit int right to fit into 10 bit
    if (highest_bit_pos(res_mant) == 21) {
        res_mant >>= 11;
        res_exp += 1;
    } else {
        res_mant >>= 10;
    }
    res_frac &= ~hidden_bit;    // Remove hidden bit

    // Construct float
    return (res_exp << bits - exponent_length - 1) | res_frac;
}

By the way: I'm storing the floats in ints, because I'll try to port this code to some kind of Assembler w/o float point operations later. 顺便说一句：我将浮点数存储在ints中，因为稍后我将尝试将此代码移植到某种不带浮点数的汇编程序中。

The Question 问题

Why does the code work for some values only? 为什么代码只对某些值有效？ Did I forget some normalization or similar? 我忘了一些标准化或类似的东西吗？ Or does it work only by accident? 还是只是偶然地起作用？

Disclaimer: I'm not a CompSci student, it's a leisure project ;) 免责声明：我不是CompSci学生，这是一个休闲项目;）

Update #1 更新＃1

Thanks to the comment by Eric Postpischil I noticed one problem with the code: the hidden_bit flag was off by one (should be 1 << 10 ). 多亏了Eric Postpischil的评论，我注意到了代码的一个问题： hidden_bit标志被1掉了（应该是1 << 10 ）。 With that change, I don't get decimal places any more, but still some calculations are off (eg 3•3=20 ). 有了这一更改，我不再获得小数位，但是仍然关闭了一些计算（例如3•3=20 ）。 I assume, it's the res_frac shift as descibred in the answers. 我认为，这就是答案中所描述的res_frac偏移。

Update #2 更新＃2

The second problem with the code was indeed the res_frac shifting. 代码的第二个问题确实是res_frac移位。 After update #1 I got wrong results when having 22 bit results of frac1 * frac2 . 更新＃1之后，当具有frac1 * frac2 22位结果时，我得到了错误的结果。 I've updated the code above with a the corrected shift statement. 我已经使用更正的shift语句更新了上面的代码。 Thanks to all for every comment and answer! 感谢所有的评论和回答！ :) :)

Answer 1

From a cursory look: 从粗略的外观：

No attempt is made to determine the location of the high bit in the product. 没有尝试确定产品中高位的位置。 Two 11-bit numbers, each their high bit set, may produce a 21- or 22-bit number. 两个11位数字，每个高1位，可以产生21位或22位数字。 (Example with two-bit numbers: 10 ₂ •10 ₂ is 100 ₂ , three bits, but 11 ₂ •11 ₂ is 1001 ₂ , four bits.) （具有两位数字的示例：10 ₂ •10 ₂是100 ₂ （三位），而11 ₂ •11 ₂是1001 ₂ （四位）。
The result is truncated instead of rounded. 结果将被截断而不是四舍五入。
Signs are ignored. 标志被忽略。
Subnormal numbers are not handled, on input or output. 在输入或输出上不处理非正规数。
11 is hardcoded as a shift amount in one place. 在一个位置将11硬编码为移位量。 This is likely incorrect; 这可能是不正确的。 the correct amount will depend on how the significand is handled for normalization and rounding. 正确的数量将取决于有效位数如何进行归一化和舍入。
In decoding, the exponent field is shifted right by fraction_length . 在解码中，指数字段向右移动fraction_length 。 In encoding, it is shifted left by bits - exponent_length - 1 . 在编码中，它向左移动了bits - exponent_length - 1 。 To avoid bugs, the same expression should be used in both places. 为避免错误，两个地方都应使用相同的表达式。

From a more detailed look by chux : 通过chux的详细介绍：

res_frac = frac1 * frac2 fails if int is less than 23 bits (22 for the product and one for the sign). 如果int小于23位（乘积为22，符号为1），则res_frac = frac1 * frac2失败。

Answer 2

One problem is that you are truncating instead of rounding: 一个问题是您要截断而不是舍入：

res_frac >>= 11;            // Shift 22bit int right to fit into 10 bit

You should compute res_frac & 0x7ff first, the part of the 22-bit result that your algorithm is about to discard, and compare it to 0x400 . 您应该首先计算res_frac & 0x7ff ，这是算法将要丢弃的22位结果的一部分，并将其与0x400进行比较。 If it is below, truncate. 如果在下面，则截断。 If it is above, round away from zero. 如果高于，则从零舍入。 If it is equal to 0x400 , round to the even alternative. 如果等于0x400 ，则四舍五入为偶数。

Answer 3

This is more a suggestion for how to make it easier to get your code right, rather than analysis of what is wrong with the existing code. 这更多地是关于如何使代码正确的建议，而不是分析现有代码的问题。

There are a number of steps that are common to some or all of the floating point arithmetic operations. 一些或所有浮点算术运算共有许多步骤。 I suggest extracting each into a function that can be written with focus on one issue, and tested separately. 我建议将它们分别提取到一个函数中，该函数可以针对一个问题编写，并分别进行测试。 Then when you come to write eg multiplication, you only have to deal with the specifics of that operation. 然后，当您编写乘法时，您只需要处理该操作的细节即可。

All the operations will be easier working with a structure that has the actual signed exponent, and the full significand in a wider unsigned integer field. 使用具有实际带符号指数且全有效位数在更宽的无符号整数字段中的结构，所有操作将更加容易。 If you were dealing with signed numbers, it would also have a boolean for the sign bit. 如果要处理带符号的数字，则符号位也将具有布尔值。

Here are some sample operations that could be separate functions, at least until you get it working: 以下是一些可能是单独功能的示例操作，至少在您开始起作用之前：

unpack: Take a 16 bit float and extract the exponent and significand into a struct. 解压：采用16位浮点数并将指数和有效位数提取到结构中。

pack: Undo unpack - deal with dropping the hidden bit, applying the bias the expoent, and combining them into a float. pack：撤消解压缩-处理掉隐藏的部分，对指数施加偏差，然后将它们组合成一个浮点数。

normalize: Shift the significand and adjust the exponent to bring the most significant 1-bit to a specified bit position. 归一化：移位有效位数并调整指数，以将最高有效1位移到指定的位位置。

round: Apply your rounding rules to drop low significance bits. 舍入：应用舍入规则以丢弃低有效位。 If you want to do IEEE 754 style round-to-nearest, you need a guard digit that is the most significant bit that will be dropped, and an additional bit indicating if there are any one bits of lower significance than the guard bit. 如果要舍入为最接近的IEEE 754样式，则需要一个将被丢弃的最高有效位的保护位，以及一个额外的位，指示是否有任何一位的重要性低于保护位。

C中的16位浮点乘法

问题描述

Example Output 示例输出

The Source Code 源代码

The Question 问题

Update #1 更新＃1

Update #2 更新＃2

3 个解决方案

解决方案1
3 已采纳 2013-08-28 16:08:17

解决方案2
1 2013-08-28 16:10:53

解决方案3
1 2013-08-28 18:44:39

C中的16位浮点乘法

问题描述

Example Output 示例输出

The Source Code 源代码

The Question 问题

Update #1 更新＃1

Update #2 更新＃2

3 个解决方案

解决方案1 3 已采纳 2013-08-28 16:08:17

解决方案2 1 2013-08-28 16:10:53

解决方案3 1 2013-08-28 18:44:39

解决方案1
3 已采纳 2013-08-28 16:08:17

解决方案2
1 2013-08-28 16:10:53

解决方案3
1 2013-08-28 18:44:39