简体   繁体   中英

Floating point computations and rounding

I think I read somewhere that cpu's "do some floating point computations in 50 bits so that they can round down to 32 bits correctly". What I think this means is that the intermediate value computed in aforementioned floating point operation is computed in 50 bits so that "correct rounding to float size" can be achieved.

What does this statement mean? Is it true? If so, can you point me to some resources which explain why they need to compute 18 extra bits? Why not 19 or 17? Or is it just plain false?

Edit:- I found this link which is quite helpful and exhaustive... http://www.gamasutra.com/view/news/167402/Indepth_Intermediate_floatingpoint_precision.php

thanks

I can't guarantee it by any means, but I'd guess what you ran into was really 53 bits rather than 50. The reason they'd use 53 bits is because that's the next standard size of floating point type. In the IEEE 754 standard, the smallest type is 32 bits total. The next size up is 64 bits total, which has a 53-bit significand (aka mantissa). Since they already have hardware in place to deal specifically with that size, it's probably easiest (in most cases) to carry out the calculation at that size, and then round to the smaller size.

It is common on modern computers that computing in double-precision (1 sign bit, 11 exponent bit, 52 explicit significand bits) is as fast as computing in single-precision (1 sign bit, 8 exponent bits, 23 significand bits). Therefore, when you load float objects, calculate, and store float objects, the compiler may load the float values into double-precision registers, calculate in double-precision, and store single-precision results. This benefits you by providing extra precision at very little cost. Results may be more often “correctly rounded” (the result returned is the representable value nearest the mathematically exact result), but this is not guaranteed (because there are still rounding errors, which can interact in unexpected ways) or may often be more accurate (closer to the exact result than float calculations would provide) (but that is also not guaranteed), but, in rare cases, a double-precision calculation can return a result worse than single-precision calculation.

There are times when double-precision is more expensive than single-precision, notably when performing SIMD programming.

Commonly, high-level languages leave the compiler free to decide how to evaluate floating-point expressions, so a compiler may use single-precision or double-precision depending on the vendor's choices (or quality of the compiler), optimization and target switches you have passed to the compiler, other aspects of the code being compiled (eg, availability of machine registers to do the calculations in), and other factors that may be random for practical purposes. So this is not behavior you can rely on.

Another meaning for what you heard might be that library routines for single-precision functions, such as sinf or logf, may be written in double-precision so that it is easier for them to get the desired results than if they had to be written entirely in single-precision. That is common. However, such library routines are carefully written by experts who analyze the errors that may occur during the calculations, so it is not simply a matter of assuming that more bits give better results.

This has to do with epsilon values. For example take the classic 0.1 + 0.2 problem: http://0.30000000000000004.com/

In most languages, 0.1 + 0.2 != 0.3. That is because while 0.1 and 0.2 are terminating decimals in base 10, in base 2, 0.1 looks like 0.0001100110011... and 0.2 looks like 0.001100110011... so that means when you add the two values together, you will get a repeating binary number that approaches 0.3 when you get infinite precision, similar to how 0.333333333... + 0.33333333.... approaches 2/3 as you get more and more precision.

In terms of why 18 extra bits vs 19 extra bits, that's a more complex discussion. See http://en.wikipedia.org/wiki/Machine_epsilon for more details.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM