简体   繁体   中英

c++ float subtraction rounding error

I have a float value between 0 and 1. I need to convert it with -120 to 80. To do this, first I multiply with 200 after 120 subtract. When subtract is made I had rounding error. Let's look my example.

    float val = 0.6050f;
    val *= 200.f;

Now val is 121.0 as I expected.

    val -= 120.0f;    

Now val is 0.99999992

I thought maybe I can avoid this problem with multiplication and division.

    float val = 0.6050f;
    val *= 200.f;
    val *= 100.f;
    val -= 12000.0f;    
    val /= 100.f;

But it didn't help. I have still 0.99 on my hand.

Is there a solution for it?

Edit: After with detailed logging, I understand there is no problem with this part of code. Before my log shows me "0.605", after I had detailed log and I saw "0.60499995946884155273437500000000000000000000000000" the problem is in different place.

Edit2: I think I found the guilty. The initialised value is 0.5750.

std::string floatToStr(double d)
{
    std::stringstream ss;
    ss << std::fixed << std::setprecision(15) << d;
    return ss.str();
}

int main()
{    
    float val88 = 0.57500000000f;
    std::cout << floatToStr(val88) << std::endl;
}

The result is 0.574999988079071

Actually I need to add and sub 0.0025 from this value every time. Normally I expected 0.575, 0.5775, 0.5800, 0.5825 ....

Edit3: Actually I tried all of them with double. And it is working for my example.

std::string doubleToStr(double d)
{
    std::stringstream ss;
    ss << std::fixed << std::setprecision(15) << d;
    return ss.str();
}

int main()
{    
    double val88 = 0.575;
    std::cout << doubleToStr(val88) << std::endl;
    val88 += 0.0025;
    std::cout << doubleToStr(val88) << std::endl;
    val88 += 0.0025;
    std::cout << doubleToStr(val88) << std::endl;
    val88 += 0.0025;
    std::cout << doubleToStr(val88) << std::endl;

    return 0;
}

The results are:

0.575000000000000
0.577500000000000
0.580000000000000
0.582500000000000

But I bound to float unfortunately. I need to change lots of things.

Thank you for all to help.

Edit4: I have found my solution with strings. I use ostringstream's rounding and convert to double after that. I can have 4 precision right numbers.

std::string doubleToStr(double d, int precision)
{
    std::stringstream ss;
    ss << std::fixed << std::setprecision(precision) << d;
    return ss.str();
}

    double val945 = (double)0.575f;
    std::cout << doubleToStr(val945, 4) << std::endl;
    std::cout << doubleToStr(val945, 15) << std::endl;
    std::cout << atof(doubleToStr(val945, 4).c_str()) << std::endl;

and results are:

0.5750
0.574999988079071
0.575

Let us assume that your compiler implements IEEE 754 binary32 and binary64 exactly for float and double values and operations.

First, you must understand that 0.6050f does not represent the mathematical quantity 6050 / 10000. It is exactly 0.605000019073486328125 , the nearest float to that. Even if you write perfect computations from there, you have to remember that these computations start from 0.605000019073486328125 and not from 0.6050.

Second, you can solve nearly all your accumulated roundoff problems by computing with double and converting to float only in the end:

$ cat t.c
#include <stdio.h>

int main(){
  printf("0.6050f is %.53f\n", 0.6050f);
  printf("%.53f\n", (float)((double)0.605f * 200. - 120.));
}

$ gcc t.c && ./a.out 
0.6050f is 0.60500001907348632812500000000000000000000000000000000
1.00000381469726562500000000000000000000000000000000000

In the above code, all computations and intermediate values are double-precision.

This 1.0000038… is a very good answer if you remember that you started with 0.605000019073486328125 and not 0.6050 (which doesn't exist as a float ).

If you really care about the difference between 0.99999992 and 1.0, float is not precise enough for your application. You need to at least change to double .

If you need an answer in a specific range, and you are getting answers slightly outside that range but within rounding error of one of the ends, replace the answer with the appropriate range end.

The point everybody is making can be summarised: in general, floating point is precise but not exact .

How precise is governed by the number of bits in the mantissa -- which is 24 for float, and 53 for double (assuming IEEE 754 binary formats, which is pretty safe these days ! [1]).

If you are looking for an exact result, you have to be ready to deal with values that differ (ever so slightly) from that exact result, but...


(1) The Exact Binary Fraction Problem

...the first issue is whether the exact value you are looking for can be represented exactly in binary floating point form...

...and that is rare -- which is often a disappointing surprise.

The binary floating point representation of a given value can be exact, but only under the following, restricted circumstances:

  • the value is an integer, < 2^24 (float) or < 2^53 (double).

    this is the simplest case, and perhaps obvious. Since you are looking a result >= -120 and <= 80, this is sufficient.

or:

  • the value is an integer which divides exactly by 2^n and is then (as above) < 2^24 or < 2^53.

    this includes the first rule, but is more general.

or:

  • the value has a fractional part, but when the value is multiplied by the smallest 2^n necessary to produce an integer, that integer is < 2^24 (float) or 2^53 (double).

    This is the part which may come as a surprise.

    Consider 27.01, which is a simple enough decimal value, and clearly well within the ~7 decimal digit precision of a float. Unfortunately , it does not have an exact binary floating point form -- you can multiply 27.01 by any 2^n you like, for example:

      27.01 * (2^ 6) = 1728.64 (multiply by 64) 27.01 * (2^ 7) = 3457.28 (multiply by 128) ... 27.01 * (2^10) = 27658.24 ... 27.01 * (2^20) = 28322037.76 ... 27.01 * (2^25) = 906305208.32 (> 2^24 !) 

    and you never get an integer, let alone one < 2^24 or < 2^53.

    Actually, all these rules boil down to one rule... if you can find an 'n' (positive or negative, integer) such that y = value * (2^n) , and where y is an exact , odd integer, then value has an exact representation if y < 2^24 (float) or if y < 2^53 (double) -- assuming no under- or over-flow, which is another story.

This looks complicated, but the rule of thumb is simply: " very few decimal fractions can be represented exactly as binary fractions ".

To illustrate how few , let us consider all the 4 digit decimal fractions, of which there are 10000, that is 0.0000 up to 0.9999 -- including the trivial, integer case 0.0000. We can enumerate how many of those have exact binary equivalents:

   1: 0.0000 =  0/16 or 0/1
   2: 0.0625 =  1/16
   3: 0.1250 =  2/16 or 1/8
   4: 0.1875 =  3/16
   5: 0.2500 =  4/16 or 1/4
   6: 0.3125 =  5/16
   7: 0.3750 =  6/16 or 3/8
   8: 0.4375 =  7/16
   9: 0.5000 =  8/16 or 1/2
  10: 0.5625 =  9/16
  11: 0.6250 = 10/16 or 5/8
  12: 0.6875 = 11/16
  13: 0.7500 = 12/16 or 3/4
  14: 0.8125 = 13/16
  15: 0.8750 = 14/16 or 7/8
  16: 0.9375 = 15/16

That's it ! Just 16/10000 possible 4 digit decimal fractions ( including the trivial 0 case) have exact binary fraction equivalents, at any precision. All the other 9984/10000 possible decimal fractions give rise to recurring binary fractions. So, for 'n' digit decimal fractions only (2^n) / (10^n) can be represented exactly -- that's 1/(5^n) !!

This is, of course, because your decimal fraction is actually the rational x / (10^n) [2] and your binary fraction is y / (2^m) (for integer x, y, n and m), and for a given binary fraction to be exactly equal to a decimal fraction we must have:

  y = (x / (10^n)) * (2^m)   
    = (x / ( 5^n)) * (2^(m-n))

which is only the case when x is an exact multiple of (5^n) -- for otherwise y is not an integer. (Noting that n <= m , assuming that x has no (spurious) trailing zeros, and hence n is as small as possible.)


(2) The Rounding Problem

The result of a floating point operation may need to be rounded to the precision of the destination variable. IEEE 754 requires that the operation is done as if there were no limit to the precision, and the ("true") result is then rounded to the nearest value at the precision of the destination. So, the final result is as precise as it can be... given the limitations on how precise the arguments are, and how precise the destination is... but not exact !

(With floats and doubles, 'C' may promote float arguments to double (or long double) before performing an operation, and the result of that will be rounded to double. The final result of an expression may then be a double (or long double), which is then rounded (again) if it is to be stored in a float variable. All of this adds to the fun ! See FLT_EVAL_METHOD for what your system does -- noting the default for a floating point constant is double.)

So, the other rules to remember are:

  • floating point values are not reals (they are, in fact, rationals with a limited denominator).

    The precision of a floating point value may be large, but there are lots of real numbers that cannot be represented exactly !

  • floating point expressions are not algebra .

    For example, converting from degrees to radians requires division by π . Any arithmetic with π has a problem ('cos it's irrational), and with floating point the value for π is rounded to whatever floating precision we are using. So, the conversion of (say) 27 (which is exact) degrees to radians involves division by 180 (which is exact) and multiplication by our " π ". However exact the arguments, the division and the multiplication may round, so the result is may only approximate. Taking:

      float pi = 3.14159265358979 ; /* plenty for float */ float x = 27.0 ; float y = (x / 180.0) * pi ; float z = (y / pi) * 180.0 ; printf("zx = %+6.3e\\n", zx) ; 

    my (pretty ordinary) machine gave: "zx = +1.907e-06"... so, for our floating point:

     x != (((x / 180.0) * pi) / pi) * 180 ; 

    at least, not for all x . In the case shown, the relative difference is small -- ~ 1.2 / (2^24) -- but not zero, which simple algebra might lead us to expect.

  • hence: floating point equality is a slippery notion .

    For all the reasons above, the test x == y for two floating values is problematic. Depending on how x and y have been calculated, if you expect the two to be exactly the same, you may very well be sadly disappointed.


[1] There exists a standard for decimal floating point, but generally binary floating point is what people use.

[2] For any decimal fraction you can write down with a finite number of digits !

Even with double precision, you'll run into issues such as:

200. * .60499999999999992 = 120.99999999999997

It appears that you want some type of rounding so that 0.99999992 is rounded to 1.00000000 .

If the goal is to produce values to the nearest multiple of 1/1000, try:

#include <math.h>

    val = (float) floor((200000.0f*val)-119999.5f)/1000.0f;

If the goal is to produce values to the nearest multiple of 1/200, try:

    val = (float) floor((40000.0f*val)-23999.5f)/200.0f;

If the goal is to produce values to the nearest integer, try:

    val = (float) floor((200.0f*val)-119.5f);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM