简体   繁体   中英

Floating-Point Arithmetic = What is the worst precision/difference from Dec to Binary?

as all know decimal fractions (like 0.1) , when stored as floating point (like double or float) will be internally represented in "binary format" (IEEE 754). And some decimal fractions can not directly be represented in binary format.

What I have not understood, is the precision of this "conversion":

1.) A Floating point itself can have a precision (that is the "significant")?

2.) But also the conversion from decimal fraction to binary fraction has a precision loss?

Question:

What is the worst case precision loss (for "all" possible decimal fractions) when converting from decimal fractions to floating point fractions?

(The reason I want to know this is, when comparing decimal fractions with binary/floating point fractions I need to take the precision into account...to determine if both figures are identical. And I want this precision to be as tight/precise as possible (decimal fraction == binary fraction +/- precision)

Example (only hypothetical)

0,1 dec => 0,10000001212121212121212 (binary fraction double) => precision loss 0,00000001212121212121212
0,3 dec => 0,300000282828282 (binary fraction double) => precision loss  0,000000282828282

It is not entirely clear to me what you are after, but you may be interested in the following paper which discusses many of the accuracy issues involved in binary/decimal conversion, including lists of hard cases.

Vern Paxson and William Kahan. A program for testing IEEE decimal-binary conversion. May 22, 1991 http://www.icir.org/vern/papers/testbase-report.pdf

Floating point will become more and more inaccurate the larger it gets (both in the positive and negative directions). This is because floating point values are an exponential format.

However, decimal will become more and more exact the more decimal places it uses, regardless of how large it is.

Therefore, the worst precision difference would be towards the numerical limits of whatever floating point type you're using.

Due to the way we're taught to count when children, it is difficult to fully appreciate the precision characteristics of binary fractions. The problem is that a fraction can only be in terms of the power of the counting system. It seems so obvious to say, but the basic problem is that decimal divides things into tens whilst binary divides things into twos (halves).

Most of the time, there are two times you want a floating-point value in computing: when it is a currency value and when it is not. The latter could range from an input from an encoder on a spinning shaft to a position in a virtual space for handing to a graphics engine. There is no problem with the fractional value being in binary because it truly is a fractional value. This is partly why FPUs bacame popular for 3D graphics years ago.

The problem comes with representing currency where the fractional part is actually discrete decimal units. You can have 0.01 of a dollar (depending on which dollar it is!) in the real world, but this is difficult to accurately represent in binary. This is why you should never use binary floating point for currency.

If you are converting between decimal and binary floating point and trying to make comparisons, I'd be looking at why you're doing conversions and what the comparisons are supposed to achieve.

Provided that the decimal value falls into the range of representable floating-point values, and your language/implementation has correctly-rounded conversions (many do, some don't), the error from such a conversion is bounded by 1/2 of the distance between consecutive floating-point numbers, or "ulp" (Unit in the Last Place).

The relative size of an ulp is biggest between an exact power of two and the next larger number, so the largest relative error of conversion between decimal and double is achieved when the input is just barely smaller than 1 + 1/2 ulp, or that value scaled by a power of two. An example of such a value is:

1.0000000000000001110223024625156540423631668090820312

(That's almost infinitesimally smaller than 1 + 2^-53).

Since the error from conversion has a relative bound, the absolute error gets bigger as we scale this value up by powers of two, obviously.

Of course, if a number falls outside of the range of representable values (either by being too big or too small), then all precision is lost. Converting, say 1e400 to double yields infinity ; no trace of our actual input remains. Similarly, converting 1e-400 to double produces zero.

The bigger the number gets, the higher the precision loss can be (but it might be precisely your number, which you specify).

You don't only store very small numbers in java as float or double, but very big numbers too like 9*10^105.

And I want this precision to be as tight/precise as possible

You may choose BigDecimal, where you can specify, how precise you like to get, but of course you're somehow limited by RAM, by CPU-time, by the limits of the JVM.

Are you only interested in absolute precision, or in relative precision?

compare the difference in the precision of:

a = 100000000000000,0000000000000001 
b = 100000000000000,0000000000000002

layoutHonkyTonkA= 0,0000000000000001 
layoutHonkyTonkB= 0,0000000000000002

The absolute precision difference is the same, but the relative precision difference is very different.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM