简体   繁体   English

浮点运算中的整数转换

[英]Integer Conversion in Floating Point Arithmetic

I currently face the following dilemma: 我目前面临以下两难困境:

1.0f * INT32_MAX != INT32_MAX

Evaluating 1.0f * INT32_MAX actually gives me INT32_MIN 评估1.0f * INT32_MAX实际上给了我INT32_MIN

I'm not completely surprised by this, I know floating point to integer conversions aren't always exact. 我并不完全对此感到惊讶,我知道整数转换的浮点并不总是准确的。

What is the best way to fix this problem? 解决此问题的最佳方法是什么?

The code I'm writing is scaling an array of rational numbers: from -1.0f <= x <= 1.0f to INT32_MIN <= x <= INT32_MAX 我正在编写的代码是缩放有理数的数组:从-1.0f <= x <= 1.0fINT32_MIN <= x <= INT32_MAX

Here's what the code looks like: 这是代码的样子:

void convert(int32_t * dst, const float * src, size_t count){
    size_t i = 0;
    for (i = 0; i < count; i++){
        dst[i] = src[i] * INT32_MAX;
    }
}

Here's what I ended up with: 这是我最终得到的:

void convert(int32_t * dst, const float * src, size_t count){
    size_t i = 0;
    for (i = 0; i < count; i++){
        double tmp = src[i];
        if (src[i] > 0.0f){
            tmp *= INT32_MAX;
        } else {
            tmp *= INT32_MIN;
            tmp *= -1.0;
        }
        dst[i] = tmp;
    }
}

In IEEE754, 2147483647 is not representable in a single precision float. 在IEEE754中,2147483647在单精度浮点数中不可表示。 A quick test shows that the result of 1.0f * INT32_MAX is rounded to 2147483648.0f , which can't be represented in an int. 快速测试显示1.0f * INT32_MAX的结果四舍五入为2147483648.0f ,无法用int表示。

In other words, it is actually the conversion to int that causes the problem, not the float calculation, which happens to be only 1 off! 换句话说,它实际上是转换为int导致问题,而不是浮点计算,恰好只有1关!

Anyway, the solution is to use double for the intermediate calculation. 无论如何,解决方案是使用double进行中间计算。 2147483647.0 is OK as a double precision number. 2147483647.0可以作为双精度数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM