简体   繁体   English

Float32 到 Float16

[英]Float32 to Float16

Can someone explain to me how I convert a 32-bit floating point value to a 16-bit floating point value?有人可以向我解释如何将 32 位浮点值转换为 16 位浮点值吗?

(s = sign e = exponent and m = mantissa) (s = 符号 e = 指数,m = 尾数)

If 32-bit float is 1s7e24m如果 32 位浮点数为 1s7e24m
And 16-bit float is 1s5e10m 16 位浮点数是 1s5e10m

Then is it as simple as doing?那么做起来就那么简单吗?

int     fltInt32;
short   fltInt16;
memcpy( &fltInt32, &flt, sizeof( float ) );

fltInt16 = (fltInt32 & 0x00FFFFFF) >> 14;
fltInt16 |= ((fltInt32 & 0x7f000000) >> 26) << 10;
fltInt16 |= ((fltInt32 & 0x80000000) >> 16);

I'm assuming it ISN'T that simple ... so can anyone tell me what you DO need to do?我假设它不是那么简单......所以谁能告诉我你需要做什么?

Edit: I cam see I've got my exponent shift wrong ... so would THIS be better?编辑:我看到我的指数偏移错误......那么这会更好吗?

fltInt16 =  (fltInt32 & 0x007FFFFF) >> 13;
fltInt16 |= (fltInt32 & 0x7c000000) >> 13;
fltInt16 |= (fltInt32 & 0x80000000) >> 16;

I'm hoping this is correct.我希望这是正确的。 Apologies if I'm missing something obvious that has been said.如果我遗漏了一些明显的已说过的话,我深表歉意。 Its almost midnight on a friday night ... so I'm not "entirely" sober ;)星期五晚上几乎是午夜......所以我不是“完全”清醒;)

Edit 2: Ooops.编辑2:哎呀。 Buggered it again.又惹毛了。 I want to lose the top 3 bits not the lower!我想丢掉前 3 位而不是低位! So how about this:那么这个怎么样:

fltInt16 =  (fltInt32 & 0x007FFFFF) >> 13;
fltInt16 |= (fltInt32 & 0x0f800000) >> 13;
fltInt16 |= (fltInt32 & 0x80000000) >> 16;

Final code should be :最终代码应该是

fltInt16    =  ((fltInt32 & 0x7fffffff) >> 13) - (0x38000000 >> 13);
fltInt16    |= ((fltInt32 & 0x80000000) >> 16);

The exponent needs to be unbiased, clamped and rebiased.指数需要无偏、钳制和再偏。 This is the fast code I use:这是我使用的快速代码:

unsigned int fltInt32;
unsigned short fltInt16;

fltInt16 = (fltInt32 >> 31) << 5;
unsigned short tmp = (fltInt32 >> 23) & 0xff;
tmp = (tmp - 0x70) & ((unsigned int)((int)(0x70 - tmp) >> 4) >> 27);
fltInt16 = (fltInt16 | tmp) << 10;
fltInt16 |= (fltInt32 >> 13) & 0x3ff;

This code will be even faster with a lookup table for the exponent, but I use this one because it is easily adapted to a SIMD workflow.使用指数查找表,此代码会更快,但我使用此代码是因为它很容易适应 SIMD 工作流程。

Limitations of the implementation:实施的限制:

  • Overflowing values that cannot be represented in float16 will give undefined values.无法在 float16 中表示的溢出值将给出未定义的值。
  • Underflowing values will return an undefined value between 2^-15 and 2^-14 instead of zero.下溢值将返回2^-152^-14之间的未定义值,而不是零。
  • Denormals will give undefined values.非正规将给出未定义的值。

Be careful with denormals.小心非规范化。 If your architecture uses them, they may slow down your program tremendously.如果您的架构使用它们,它们可能会极大地减慢您的程序速度。

The exponents in your float32 and float16 representations are probably biased, and biased differently.您的 float32 和 float16 表示中的指数可能有偏差,并且偏差不同。 You need to unbias the exponent you got from the float32 representation to get the actual exponent, and then to bias it for the float16 representation.您需要对从 float32 表示中获得的指数进行无偏差以获得实际指数,然后将其偏置为 float16 表示。

Apart from this detail, I do think it's as simple as that, but I still get surprised by floating-point representations from time to time.除了这个细节,我确实认为就这么简单,但我仍然不时地对浮点表示感到惊讶。

EDIT:编辑:

  1. Check for overflow when doing the thing with the exponents while you're at it.当你在做指数的时候,检查是否有溢出。

  2. Your algorithm truncates the last bits of the mantisa a little abruptly, that may be acceptable but you may want to implement, say, round-to-nearest by looking at the bits that are about to be discarded.您的算法会突然截断尾数的最后一位,这可能是可以接受的,但您可能希望通过查看即将被丢弃的位来实现舍入到最接近的位。 "0..." -> round down, "100..001..." -> round up, "100..00" -> round to even. "0..." -> 四舍五入,"100..001..." -> 四舍五入,"100..00" -> 四舍五入。

Here's the link to an article on IEEE754, which gives the bit layouts and biases.这是一篇关于 IEEE754 的文章的链接,它给出了位布局和偏差。

http://en.wikipedia.org/wiki/IEEE_754-2008 http://en.wikipedia.org/wiki/IEEE_754-2008

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM