快速浮点量化，按精度缩放？

Question

Since float precision reduces for larger values, in some cases it may be useful to quantize the value based on its size - instead of quantizing by an absolute value.由于浮点精度对于较大的值会降低，因此在某些情况下，根据其大小量化值可能很有用 - 而不是按绝对值量化。

A naive approach could be to detect the precision and scale it up:一种天真的方法可能是检测精度并放大它：

float quantize(float value, float quantize_scale) {
    float factor = (nextafterf(fabsf(value)) - fabsf(value)) * quantize_scale;
    return floorf((value / factor) + 0.5f) * factor;
}

However this seems too heavy.然而，这似乎太重了。

Instead, it should be possible to mask out bits in the floats mantisa to simulate something like casting to a 16bit float, then back - for eg.相反，应该可以屏蔽浮点数尾数中的位来模拟类似转换为 16 位浮点数的内容，然后返回 - 例如。

Not being expert in float bit twiddling, I couldn't say if the resulting float would be valid (or need normalizing)不是浮点运算方面的专家，我不能说结果浮点数是否有效（或需要标准化）

For speed, when exact behavior regarding rounding isn't important, what is a fast way to quantize floats, taking their magnitude into account?对于速度，当关于舍入的确切行为并不重要时，考虑到它们的大小，量化浮点数的快速方法是什么？

Answer 1

The Veltkamp-Dekker splitting algorithm will split a floating-point number into high and low parts. Veltkamp-Dekker 拆分算法将浮点数拆分为高和低部分。 Sample code is below.示例代码如下。

If there are s bits in the significand (53 in IEEE 754 64-bit binary), and the value Scale in the code below is 2 ^b , then *x0 receives the high s - b bits of x , and *x1 receives the remaining bits, which you may discard (or remove from the code below, so it is never calculated).如果有s比特的有效数（在IEEE 754的64位二进制53），并且该值Scale在以下代码是2 ^b，则*x0接收高秒-的b个比特x ，和*x1接收剩余位，您可以丢弃（或从下面的代码中删除，因此永远不会计算）。 If b is known at compile time, eg, the constant 43, you can replace Scale with the appropriate constant, such as 0x1p43 .如果b在编译时已知，例如常量 43，您可以用适当的常量替换Scale ，例如0x1p43 。 Otherwise, you must produce 2 ^b in some way.否则，您必须以某种方式产生 2 ^b 。

This requires round-to-nearest mode.这需要舍入到最近模式。 IEEE 754 arithmetic suffices, but other reasonable arithmetic may be okay too. IEEE 754 算术就足够了，但其他合理的算术也可以。 It rounds ties to even.它四舍五入关系到偶数。

This assumes that x * (Scale + 1) does not overflow.这假设x * (Scale + 1)不会溢出。 The operations must be evaluated in the same precision as the value being separated.必须以与被分隔值相同的精度评估操作。 ( double for double , float for float , and so on. If the compiler evaluates float expressions with double , this would break. A workaround would be to convert the inputs to the widest floating-point type supported, perform the split in that type [with Scale adjusted correspondingly], and then convert back.) （ double表示double ， float表示float ，依此类推。如果编译器使用double计算float表达式，则会中断。解决方法是将输入转换为支持的最宽浮点类型，在该类型中执行拆分 [ Scale相应调整]，然后转换回来。）

void Split(double *x0, double *x1, double x)
{
    double d = x * (Scale + 1);
    double t = d - x;
    *x0 = d - t;
    *x1 = x - *x0;
}

快速浮点量化，按精度缩放？

问题描述

1 个解决方案

解决方案1
1 2018-02-09 04:54:58

快速浮点量化，按精度缩放？

问题描述

1 个解决方案

解决方案1 1 2018-02-09 04:54:58

解决方案1
1 2018-02-09 04:54:58