简体   繁体   English

快速浮点量化,按精度缩放?

[英]Fast float quantize, scaled by precision?

Since float precision reduces for larger values, in some cases it may be useful to quantize the value based on its size - instead of quantizing by an absolute value.由于浮点精度对于较大的值会降低,因此在某些情况下,根据其大小量化值可能很有用 - 而不是按绝对值量化。

A naive approach could be to detect the precision and scale it up:一种天真的方法可能是检测精度并放大它:

float quantize(float value, float quantize_scale) {
    float factor = (nextafterf(fabsf(value)) - fabsf(value)) * quantize_scale;
    return floorf((value / factor) + 0.5f) * factor;
}

However this seems too heavy.然而,这似乎太重了。

Instead, it should be possible to mask out bits in the floats mantisa to simulate something like casting to a 16bit float, then back - for eg.相反,应该可以屏蔽浮点数尾数中的位来模拟类似转换为 16 位浮点数的内容,然后返回 - 例如。

Not being expert in float bit twiddling, I couldn't say if the resulting float would be valid (or need normalizing)不是浮点运算方面的专家,我不能说结果浮点数是否有效(或需要标准化)


For speed, when exact behavior regarding rounding isn't important, what is a fast way to quantize floats, taking their magnitude into account?对于速度,当关于舍入的确切行为并不重要时,考虑到它们的大小,量化浮点数的快速方法是什么?

The Veltkamp-Dekker splitting algorithm will split a floating-point number into high and low parts. Veltkamp-Dekker 拆分算法将浮点数拆分为高和低部分。 Sample code is below.示例代码如下。

If there are s bits in the significand (53 in IEEE 754 64-bit binary), and the value Scale in the code below is 2 b , then *x0 receives the high s - b bits of x , and *x1 receives the remaining bits, which you may discard (or remove from the code below, so it is never calculated).如果有s比特的有效数(在IEEE 754的64位二进制53),并且该值Scale在以下代码是2 b,*x0接收高-的b个比特x ,和*x1接收剩余位,您可以丢弃(或从下面的代码中删除,因此永远不会计算)。 If b is known at compile time, eg, the constant 43, you can replace Scale with the appropriate constant, such as 0x1p43 .如果b在编译时已知,例如常量 43,您可以用适当的常量替换Scale ,例如0x1p43 Otherwise, you must produce 2 b in some way.否则,您必须以某种方式产生 2 b

This requires round-to-nearest mode.这需要舍入到最近模式。 IEEE 754 arithmetic suffices, but other reasonable arithmetic may be okay too. IEEE 754 算术就足够了,但其他合理的算术也可以。 It rounds ties to even.它四舍五入关系到偶数。

This assumes that x * (Scale + 1) does not overflow.这假设x * (Scale + 1)不会溢出。 The operations must be evaluated in the same precision as the value being separated.必须以与被分隔值相同的精度评估操作。 ( double for double , float for float , and so on. If the compiler evaluates float expressions with double , this would break. A workaround would be to convert the inputs to the widest floating-point type supported, perform the split in that type [with Scale adjusted correspondingly], and then convert back.) double表示doublefloat表示float ,依此类推。如果编译器使用double计算float表达式,则会中断。解决方法是将输入转换为支持的最宽浮点类型,在该类型中执行拆分 [ Scale相应调整],然后转换回来。)

void Split(double *x0, double *x1, double x)
{
    double d = x * (Scale + 1);
    double t = d - x;
    *x0 = d - t;
    *x1 = x - *x0;
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM