简体   繁体   中英

Floating Point Addition / Multiplication / Division

I was doing some homework problems from my textbook and had a few questions on floating point rounding / precision for certain arithmetic operations.

If I have casted doubles from an int like so:

int x = random();
double dx = (double) x; 

And let's say the variables y , z , dy , and dz follow the same format.

Then would operations like:

(dx + dy) + dz == dx + (dy + dz)
(dx * dy) * dz == dx * (dy * dz)

be associative? I know that if we have fractional representations, then it would not be associative because some precision will be lost due to rounding depending on which operands add / multiply each other. However, since these are casted from ints, I feel like the precision would not be a problem and that these can be associative?

And lastly, the textbook I'm using does not explain FP division at all so I was wondering if this statement was true, or at least just how floating point division works in general:

dx / dx == dz / dz

I looked this up online and I read in some areas like an operation like 3/3 can yield .999...9 but there wasn't enough information to explain how that happened or if it would vary with other division operations.

Assuming int is at most 32-bit, and double follows IEEE-754. double can store integer value at most 2 53 precisely.


In the case of addition:

(dx + dy) + dz == dx + (dy + dz)

Both sides of == will have their precise values, so it is associative.


While in the case of multiplication:

(dx * dy) * dz == dx * (dy * dz)

It's possible that the value is over 2 53 , so they are not guaranteed to be equal.

You should understand that floating point numbers are typically internally represented as a sign bit, a fixed point mantissa (of 52 bits with an implied leading one for IEEE 64-bit doubles ), and a binary exponent (11 bits for IEEE doubles). You can think of the exponent as the "quantum" of math units for a given value.

The addition should be associative if the sums all fit into the mantissa without the exponent going above 2 0 == 1. If random() is producing 32-bit integers, a sum such as (dx + dy) + dz will fit, and the addition will be associative.

In the case of multiplication, it's easy to see that the product of 2 32-bit numbers may go well over 53 bits, so the exponent may need to go above 1 for the mantissa to contain the magnitude of the result, so associativity fails.

For division, in the particular case of dx / dx , the compiler may replace the expression with a constant 1.0 (perhaps after a zero check).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM