简体   繁体   中英

Integer Conversion in Floating Point Arithmetic

I currently face the following dilemma:

1.0f * INT32_MAX != INT32_MAX

Evaluating 1.0f * INT32_MAX actually gives me INT32_MIN

I'm not completely surprised by this, I know floating point to integer conversions aren't always exact.

What is the best way to fix this problem?

The code I'm writing is scaling an array of rational numbers: from -1.0f <= x <= 1.0f to INT32_MIN <= x <= INT32_MAX

Here's what the code looks like:

void convert(int32_t * dst, const float * src, size_t count){
    size_t i = 0;
    for (i = 0; i < count; i++){
        dst[i] = src[i] * INT32_MAX;
    }
}

Here's what I ended up with:

void convert(int32_t * dst, const float * src, size_t count){
    size_t i = 0;
    for (i = 0; i < count; i++){
        double tmp = src[i];
        if (src[i] > 0.0f){
            tmp *= INT32_MAX;
        } else {
            tmp *= INT32_MIN;
            tmp *= -1.0;
        }
        dst[i] = tmp;
    }
}

In IEEE754, 2147483647 is not representable in a single precision float. A quick test shows that the result of 1.0f * INT32_MAX is rounded to 2147483648.0f , which can't be represented in an int.

In other words, it is actually the conversion to int that causes the problem, not the float calculation, which happens to be only 1 off!

Anyway, the solution is to use double for the intermediate calculation. 2147483647.0 is OK as a double precision number.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM