简体   繁体   中英

Precision loss with java.lang.Double

Say I have 2 double values. One of them is very large and one of them is very small.

double x = 99....9;  // I don't know the possible max and min values,
double y = 0,00..1;  // so just assume these values are near max and min.

If I add those values together, do I lose precision?

In other words, does the max possible double value increase if I assign an int value to it? And does the min possible double value decrease if I choose a small integer part?

double z = x + y;    // Real result is something like 999999999999999.00000000000001

double values are not evenly distributed over all numbers. double uses the floating point representation of the number which means you have a fixed amount of bits used for the exponent and a fixed amount of bits used to represent the actual "numbers"/mantissa.

So in your example using a large and a small value would result in dropping the smaller value since it can not be expressed using the larger exponent.

The solution to not dropping precision is using a number format that has a potentially growing precision like BigDecimal - which is not limited to a fixed number of bits.

I'm using a decimal floating point arithmetic with a precision of three decimal digits and (roughly) with the same features as the typical binary floating point arithmetic. Say you have 123.0 and 4.56. These numbers are represented by a mantissa (0<=m<1) and an exponent: 0.123*10^3 and 0.456*10^1, which I'll write as <.123e3> and <.456e1>. Adding two such numbers isn't immediately possible unless the exponents are equal, and that's why the addition proceeds according to:

 <.123e3>   <.123e3>
 <.456e1>   <.004e3>
            --------
            <.127e3>

You see that the necessary alignment of the decimal digits according to a common exponent produces a loss of precision. In the extreme case, the entire addend could be shifted into nothingness. (Think of summing an infinite series where the terms get smaller and smaller but would still contribute considerably to the sum being computed.)

Other sources of imprecision result from differences between binary and decimal fractions, where an exact fraction in one base cannot be represented without error using the other one.

So, in short, addition and subtraction between numbers from rather different orders of magnitude are bound to cause a loss of precision.

If you try to assign too big value or too small value a double, compiler will give an error:

try this

    double d1 =  1e-1000;
    double d2 =  1e+1000;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM