[英]Fast float quantize, scaled by precision?
Since float precision reduces for larger values, in some cases it may be useful to quantize the value based on its size - instead of quantizing by an absolute value.由于浮点精度对于较大的值会降低,因此在某些情况下,根据其大小量化值可能很有用 - 而不是按绝对值量化。
A naive approach could be to detect the precision and scale it up:一种天真的方法可能是检测精度并放大它:
float quantize(float value, float quantize_scale) {
float factor = (nextafterf(fabsf(value)) - fabsf(value)) * quantize_scale;
return floorf((value / factor) + 0.5f) * factor;
}
However this seems too heavy.然而,这似乎太重了。
Instead, it should be possible to mask out bits in the floats mantisa to simulate something like casting to a 16bit float, then back - for eg.相反,应该可以屏蔽浮点数尾数中的位来模拟类似转换为 16 位浮点数的内容,然后返回 - 例如。
Not being expert in float bit twiddling, I couldn't say if the resulting float would be valid (or need normalizing)不是浮点运算方面的专家,我不能说结果浮点数是否有效(或需要标准化)
For speed, when exact behavior regarding rounding isn't important, what is a fast way to quantize floats, taking their magnitude into account?对于速度,当关于舍入的确切行为并不重要时,考虑到它们的大小,量化浮点数的快速方法是什么?
The Veltkamp-Dekker splitting algorithm will split a floating-point number into high and low parts. Veltkamp-Dekker 拆分算法将浮点数拆分为高和低部分。 Sample code is below.
示例代码如下。
If there are s bits in the significand (53 in IEEE 754 64-bit binary), and the value Scale
in the code below is 2 b , then *x0
receives the high s - b bits of x
, and *x1
receives the remaining bits, which you may discard (or remove from the code below, so it is never calculated).如果有s比特的有效数(在IEEE 754的64位二进制53),并且该值
Scale
在以下代码是2 b,则*x0
接收高秒-的b个比特x
,和*x1
接收剩余位,您可以丢弃(或从下面的代码中删除,因此永远不会计算)。 If b is known at compile time, eg, the constant 43, you can replace Scale
with the appropriate constant, such as 0x1p43
.如果b在编译时已知,例如常量 43,您可以用适当的常量替换
Scale
,例如0x1p43
。 Otherwise, you must produce 2 b in some way.否则,您必须以某种方式产生 2 b 。
This requires round-to-nearest mode.这需要舍入到最近模式。 IEEE 754 arithmetic suffices, but other reasonable arithmetic may be okay too.
IEEE 754 算术就足够了,但其他合理的算术也可以。 It rounds ties to even.
它四舍五入关系到偶数。
This assumes that x * (Scale + 1)
does not overflow.这假设
x * (Scale + 1)
不会溢出。 The operations must be evaluated in the same precision as the value being separated.必须以与被分隔值相同的精度评估操作。 (
double
for double
, float
for float
, and so on. If the compiler evaluates float
expressions with double
, this would break. A workaround would be to convert the inputs to the widest floating-point type supported, perform the split in that type [with Scale
adjusted correspondingly], and then convert back.) (
double
表示double
, float
表示float
,依此类推。如果编译器使用double
计算float
表达式,则会中断。解决方法是将输入转换为支持的最宽浮点类型,在该类型中执行拆分 [ Scale
相应调整],然后转换回来。)
void Split(double *x0, double *x1, double x)
{
double d = x * (Scale + 1);
double t = d - x;
*x0 = d - t;
*x1 = x - *x0;
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.