简体   繁体   中英

Float and double rounding in C

I went upon a very strange behavior (to me) :

    int generate_scenario_one_pass(FILE *out, double freq_mhz) {
        unsigned int d_freq, d_freq_test;
        d_freq              = (int)(freq_mhz * 20);
        d_freq_test         = (int)(float)(freq_mhz * 20);
        printf("when freq_mhz = %.1f, d_freq = 0x%04X, d_freq_test = 0x%04X\n", freq_mhz, d_freq, d_freq_test);
    }

The whole code is not here, but it's not relevant. This function is called several times with increasing values, starting from 2110.0 with an increment of 0.1.

when freq_mhz = 2110.0, d_freq = 0xA4D8, d_freq_test = 0xA4D8
when freq_mhz = 2110.1, d_freq = 0xA4DA, d_freq_test = 0xA4DA
when freq_mhz = 2110.2, d_freq = 0xA4DC, d_freq_test = 0xA4DC
when freq_mhz = 2110.3, d_freq = 0xA4DD, d_freq_test = 0xA4DE

At the last iteration, d_freq is wrong! But d_freq_test has the correct value. So my issue was solved by casting from double to float , then from float to int . I wanted to know why.

This was compiled using MSVC++ 6.0 on a x86 CPU.

There are many numbers that cannot be represented exactly as a floating-point number - and 0.1 is among them (it will be rounded to the closest number that can be represented - something along the lines of 0.0999999999999998). When using double , 2110.3 happens to be represented by a number that is slightly smaller than 2110.3, thus giving the "wrong" result when you multiply by 20 and cast to int (which will round down), while 2110.3 as a float will be represented by a number that is slightly bigger than 2110.3, thus giving the expected outcome.

Actually my double casting was not the solution.

#include <stdio.h>

int main(int argc, char **argv) {
    int d_freq, d_freq_test;
    double freq_mhz = 2110.0;
    double step = 0.1;

    while (freq_mhz < 2111.0) {
        d_freq = (int)(freq_mhz * 20.0);
        d_freq_test = (int)(float)(freq_mhz * 20.0);
        printf("freq: %.1f, d_freq: 0x%04X, d_freq_test: 0x%04X\n", freq_mhz, d_freq, d_freq_test);
        freq_mhz += step;
    }

    return 0;
}

this produces (wrong):

freq: 2110.0, d_freq: 0xA4D8, d_freq_test: 0xA4D8
freq: 2110.1, d_freq: 0xA4DA, d_freq_test: 0xA4DA
freq: 2110.2, d_freq: 0xA4DC, d_freq_test: 0xA4DC
freq: 2110.3, d_freq: 0xA4DD, d_freq_test: 0xA4DD <-- :(
freq: 2110.4, d_freq: 0xA4DF, d_freq_test: 0xA4DF
freq: 2110.5, d_freq: 0xA4E1, d_freq_test: 0xA4E1
freq: 2110.6, d_freq: 0xA4E3, d_freq_test: 0xA4E3
freq: 2110.7, d_freq: 0xA4E5, d_freq_test: 0xA4E5
freq: 2110.8, d_freq: 0xA4E7, d_freq_test: 0xA4E7
freq: 2110.9, d_freq: 0xA4E9, d_freq_test: 0xA4E9
freq: 2111.0, d_freq: 0xA4EB, d_freq_test: 0xA4EB

While this code :

#include <stdio.h>

int main(int argc, char **argv) {
    int d_freq, d_freq_test;
    double freq_mhz = 2110.0;
    double step = 0.1;

    while (freq_mhz < 2111.0) {
        d_freq = (int)(freq_mhz * 20.0);
        d_freq_test = (int)(float)(freq_mhz * 20.0 + 0.5);
        printf("freq: %.1f, d_freq: 0x%04X, d_freq_test: 0x%04X\n", freq_mhz, d_freq, d_freq_test);
        freq_mhz += step;
    }

    return 0;
}

produces:

freq: 2110.0, d_freq: 0xA4D8, d_freq_test: 0xA4D8
freq: 2110.1, d_freq: 0xA4DA, d_freq_test: 0xA4DA
freq: 2110.2, d_freq: 0xA4DC, d_freq_test: 0xA4DC
freq: 2110.3, d_freq: 0xA4DD, d_freq_test: 0xA4DE <-- :)
freq: 2110.4, d_freq: 0xA4DF, d_freq_test: 0xA4E0
freq: 2110.5, d_freq: 0xA4E1, d_freq_test: 0xA4E2
freq: 2110.6, d_freq: 0xA4E3, d_freq_test: 0xA4E4
freq: 2110.7, d_freq: 0xA4E5, d_freq_test: 0xA4E6
freq: 2110.8, d_freq: 0xA4E7, d_freq_test: 0xA4E8
freq: 2110.9, d_freq: 0xA4E9, d_freq_test: 0xA4EA
freq: 2111.0, d_freq: 0xA4EB, d_freq_test: 0xA4EC

which is right.

So it was indeed rounding issue, a precision issue, which was solved by adding 0.5 to the result of the x20 multiplication.

When you convert from double to int, you get truncation.

The value of freq_mhz*20 at 2110.3 is represented by 0x40E49BFFFFFFFFFF - which is 42207.9999999999927240423858166. When you truncate that to an int , the .999999 gets chopped off and you get 42207 (or 0xA4DD - why choose to represent these in hex?)

If you convert to a float in the meantime, you get a rounding operation performed. What you actually want to do is explicitly call round on the value and then convert to an int .

Because 0.1 cannot be exactly represented in binary floating-point. What you are seeing are approximations, exacerbated by the truncation that casting causes, and the rounding that printf causes.

One way to solve this is to explicitly round instead of truncating when casting to int (you could use round() ).

a tenth cannot be represented in binary. It's like a 1/3 in base ten. The more places after decimal point you get the closer you are but you can't get there. There are all sorts of coping strategies but basicall if you want exact representation, floating point formats won't do it. Fixed point (decimal) formats are required.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM