将 32 位浮点数打包成 30 位 (c++)

Question

Here are the goals I'm trying to achieve:以下是我正在努力实现的目标：

I need to pack 32 bit IEEE floats into 30 bits.我需要将 32 位 IEEE 浮点数打包成 30 位。
I want to do this by decreasing the size of mantissa by 2 bits.我想通过将尾数的大小减少 2 位来做到这一点。
The operation itself should be as fast as possible.操作本身应该尽可能快。
I'm aware that some precision will be lost, and this is acceptable.我知道会丢失一些精度，这是可以接受的。
It would be an advantage, if this operation would not ruin special cases like SNaN, QNaN, infinities, etc. But I'm ready to sacrifice this over speed.如果这个操作不会破坏像 SNaN、QNaN、无穷大等特殊情况，那将是一个优势。但我准备牺牲这个速度。

I guess this questions consists of two parts:我想这个问题包括两部分：

1) Can I just simply clear the least significant bits of mantissa? 1）我可以简单地清除尾数的最低有效位吗？ I've tried this, and so far it works, but maybe I'm asking for trouble... Something like:我试过这个，到目前为止它有效，但也许我是在自找麻烦......像：

float f;
int packed = (*(int*)&f) & ~3;
// later
f = *(float*)&packed;

2) If there are cases where 1) will fail, then what would be the fastest way to achieve this? 2）如果存在 1）失败的情况，那么实现这一目标的最快方法是什么？

Thanks in advance提前致谢

Answer 1

You actually violate the strict aliasing rules (section 3.10 of the C++ standard) with these reinterpret casts.这些重新解释强制转换实际上违反了严格的别名规则（C++ 标准的第 3.10 节）。 This will probably blow up in your face when you turn on the compiler optimizations.当您打开编译器优化时，这可能会在您面前炸开。

C++ standard, section 3.10 paragraph 15 says: C++ 标准，第 3.10 节第 15 段说：

If a program attempts to access the stored value of an object through an lvalue of other than one of the following types the behavior is undefined如果程序尝试通过以下类型之一以外的左值访问对象的存储值，则行为未定义

the dynamic type of the object,对象的动态类型，

a cv-qualified version of the dynamic type of the object,对象的动态类型的 cv 限定版本，

a type similar to the dynamic type of the object,类似于对象的动态类型的类型，

a type that is the signed or unsigned type corresponding to the dynamic type of the object,一种类型，它是与对象的动态类型对应的有符号或无符号类型，

a type that is the signed or unsigned type corresponding to a cv-qualified version of the dynamic type of the object,一种类型，它是与对象的动态类型的 cv 限定版本相对应的有符号或无符号类型，

an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union),在其成员中包含上述类型之一的聚合或联合类型（递归地包括子聚合或包含联合的成员），

a type that is a (possibly cv-qualified) base class type of the dynamic type of the object,一个类型，它是对象的动态类型的（可能是 cv 限定的）基类类型，

a char or unsigned char type. char 或 unsigned char 类型。

Specifically, 3.10/15 doesn't allow us to access a float object via an lvalue of type unsigned int.具体来说，3.10/15 不允许我们通过 unsigned int 类型的左值访问浮点对象。 I actually got bitten myself by this.我真的被这个咬了。 The program I wrote stopped working after turning on optimizations.我写的程序在打开优化后停止工作。 Apparently, GCC didn't expect an lvalue of type float to alias an lvalue of type int which is a fair assumption by 3.10/15.显然，GCC 不希望 float 类型的左值与 int 类型的左值别名，这是 3.10/15 的公平假设。 The instructions got shuffled around by the optimizer under the as-if rule exploiting 3.10/15 and it stopped working.优化器在利用 3.10/15 的 as-if 规则下对指令进行了调整，并且它停止工作。

Under the following assumptions在以下假设下

float really corresponds to a 32bit IEEE-float,浮点数实际上对应于 32 位 IEEE 浮点数，
sizeof(float)==sizeof(int) sizeof(float)==sizeof(int)
unsigned int has no padding bits or trap representations unsigned int 没有填充位或陷阱表示

you should be able to do it like this:你应该可以这样做：

/// returns a 30 bit number
unsigned int pack_float(float x) {
    unsigned r;
    std::memcpy(&r,&x,sizeof r);
    return r >> 2;
}

float unpack_float(unsigned int x) {
    x <<= 2;
    float r;
    std::memcpy(&r,&x,sizeof r);
    return r;
}

This doesn't suffer from the "3.10-violation" and is typically very fast.这不会受到“3.10 违规”的影响，并且通常非常快。 At least GCC treats memcpy as an intrinsic function.至少 GCC 将 memcpy 视为一个内在函数。 In case you don't need the functions to work with NaNs, infinities or numbers with extremely high magnitude you can even improve accuracy by replacing "r >> 2" with "(r+1) >> 2":如果您不需要这些函数来处理 NaN、无穷大或具有极高量级的数字，您甚至可以通过将“r >> 2”替换为“(r+1) >> 2”来提高准确性：

unsigned int pack_float(float x) {
    unsigned r;
    std::memcpy(&r,&x,sizeof r);
    return (r+1) >> 2;
}

This works even if it changes the exponent due to a mantissa overflow because the IEEE-754 coding maps consecutive floating point values to consecutive integers (ignoring +/- zero).即使由于尾数溢出而改变指数，这也有效，因为 IEEE-754 编码将连续浮点值映射到连续整数（忽略 +/- 零）。 This mapping actually approximates a logarithm quite well.这种映射实际上很好地近似了对数。

Answer 2

Blindly dropping the 2 LSBs of the float may fail for small number of unusual NaN encodings.对于少数不寻常的 NaN 编码，盲目丢弃浮点数的 2 个 LSB 可能会失败。

A NaN is encoded as exponent=255, mantissa!=0, but IEEE-754 doesn't say anything about which mantiassa values should be used. NaN 被编码为 exponent=255, mantissa!=0，但 IEEE-754 没有说明应该使用哪个尾数值。 If the mantissa value is <= 3, you could turn a NaN into an infinity!如果尾数值 <= 3，您可以将 NaN 变成无穷大！

Answer 3

You should encapsulate it in a struct, so that you don't accidentally mix the usage of the tagged float with regular "unsigned int":您应该将它封装在一个结构体中，这样您就不会意外地将标记浮点数的用法与常规的“无符号整数”混合使用：

#include <iostream>
using namespace std;

struct TypedFloat {
    private:
        union {
            unsigned int raw : 32;
            struct {
                unsigned int num  : 30;  
                unsigned int type : 2;  
            };
        };
    public:

        TypedFloat(unsigned int type=0) : num(0), type(type) {}

        operator float() const {
            unsigned int tmp = num << 2;
            return reinterpret_cast<float&>(tmp);
        }
        void operator=(float newnum) {
            num = reinterpret_cast<int&>(newnum) >> 2;
        }
        unsigned int getType() const {
            return type;
        }
        void setType(unsigned int type) {
            this->type = type;
        }
};

int main() { 
    const unsigned int TYPE_A = 1;
    TypedFloat a(TYPE_A);

    a = 3.4;
    cout << a + 5.4 << endl;
    float b = a;
    cout << a << endl;
    cout << b << endl;
    cout << a.getType() << endl;
    return 0;
}

I can't guarantee its portability though.不过我不能保证它的便携性。

Answer 4

How much precision do you need?您需要多少精度？ If 16-bit float is enough (sufficient for some types of graphics), then ILM's 16-bit float ("half"), part of OpenEXR is great, obeys all kinds of rules (http://www.openexr.com/), and you'll have plenty of space left over after you pack it into a struct.如果 16 位浮点数就足够了（对于某些类型的图形就足够了），那么 ILM 的 16 位浮点数（“一半”），OpenEXR 的一部分很棒，遵守各种规则（http://www.openexr.com/ )，并且在将其打包到结构中后您将有足够的空间。

On the other hand, if you know the approximate range of values they're going to take, you should consider fixed point.另一方面，如果您知道他们将要取的值的大致范围，您应该考虑定点。 They're more useful than most people realize.它们比大多数人意识到的更有用。

Answer 5

I can't select any of the answers as the definite one, because most of them have valid information, but not quite what I was looking for.我不能选择任何答案作为确定的答案，因为它们中的大多数都有有效的信息，但不是我想要的。 So I'll just summarize my conclusions.所以我只是总结一下我的结论。

The method for conversion I've posted in my question's part 1) is clearly wrong by C++ standard, so other methods to extract float's bits should be used.我在问题的第 1 部分中发布的转换方法在 C++ 标准中显然是错误的，因此应该使用其他方法来提取浮点数。

And most important... as far as I understand from reading the responses and other sources about IEEE754 floats, it's ok to drop the least significant bits from mantissa.最重要的是......据我从阅读有关 IEEE754 浮点数的响应和其他来源中了解到，可以从尾数中删除最低有效位。 It will mostly affect only precision, with one exception: sNaN.它只会影响精度，只有一个例外：sNaN。 Since sNaN is represented by exponent set to 255, and mantissa != 0, there can be situation where mantissa would be <= 3, and dropping last two bits would convert sNaN to +/-Infinity.由于 sNaN 由设置为 255 的指数表示，并且尾数 != 0，因此可能存在尾数 <= 3 的情况，并且丢弃最后两位会将 sNaN 转换为 +/-Infinity。 But since sNaN are not generated during floating point operations on CPU, its safe under controlled environment.但是由于 sNaN 不是在 CPU 上的浮点运算期间生成的，因此在受控环境下是安全的。

将 32 位浮点数打包成 30 位 (c++)

问题描述

5 个解决方案

解决方案1
10 2010-10-02 15:52:04

解决方案2
8 2010-10-02 16:14:06

解决方案3
2 2010-10-02 17:14:42

解决方案4
2 2011-11-02 23:38:05

解决方案5
1 已采纳 2010-10-11 15:00:01

将 32 位浮点数打包成 30 位 (c++)

问题描述

5 个解决方案

解决方案1 10 2010-10-02 15:52:04

解决方案2 8 2010-10-02 16:14:06

解决方案3 2 2010-10-02 17:14:42

解决方案4 2 2011-11-02 23:38:05

解决方案5 1 已采纳 2010-10-11 15:00:01

解决方案1
10 2010-10-02 15:52:04

解决方案2
8 2010-10-02 16:14:06

解决方案3
2 2010-10-02 17:14:42

解决方案4
2 2011-11-02 23:38:05

解决方案5
1 已采纳 2010-10-11 15:00:01