简体   繁体   中英

Floating point operation in c

We know that in C, the floating point range is from 1.xxxx * 10^-38 to 3.xxxx *10^38 for single precision.

On my lecture slides there is this operation:

(10^10 + 10^30) + (-10^30) ?= 10^10 + (10^30 + -10^30)
10^30 - 10^30 ?= 10^10 + 0

I'm wondering why 10^10 + 10^30 = 10^30 in this case?
What I thought is, since the range of FP can go down to 10^-38 and up to 10^38, there shouldn't be an overflow, so`10^10 + 10^30 shouldn't end up being 10^30.

As said in the comment to your question the part which store the digits is finite. It is referred to as the significand.

Consider the following simple 14 bit format:

[sign bit] [ 5 bit exponent] [ 8 bit significand]

let 'bias' be 16, ie if the exponent is 16 it is actually 0 (so we get a good range or +/- powers) and no implied bits

so if we have numbers greater than 2^8 apart like 2048 and 0.5

in our format:

2048 = 2^11 = [0][11011][1000 0000]

0.5 = 2^-1 = [0][01111][1000 0000]

when we add these numbers we shift the exponent so that they have the same decimal places. A decimal analogy is:

5 x 10 ^ 3 + 5 x 10 ^ -2 => 5 x 10^3 + 0.00005 x 10 ^ 3

so the siginifcand cant hold 12 places:

2 ^ 11 + 0.000000000001 x 2 ^ 11 = 1.000000000001 x 2 ^ 11

so it rounds back to 2 ^ 11

The essence is the notion of significant digits. It's roughly 7 decimal digits for IEEE754 float . If we use hypothetical decimal floating point numbers with 7 significant digits, the calculation is done in this way:

10^10 + 10^30 == 1.000 000 * 10^10 + 1.000 000 * 10^30
              == (0.000 000 000 000 000 000 01 + 1.000 000) * 10^30 (match the exponent part)
              => (0.000 000 + 1.000 000) * 10^30 (round the left operand)
              ==  1.000 000 * 10^30
              == 10^30

Note however that the matching operation and the rounding operation are done as a single step, ie. the machine can never deal with 0.000 000 000 000 000 000 01 * 10^30 which has too many significant digits.

By the way, if you conduct experiments on floating point arithmetics in C, you may find %a format specifier useful (introduced in C99.) But note that printf always implicitly converts float arguments to double .

#include <stdio.h>

int main() {
    float x = 10e10, y = 10e30;
    printf("(%a + %a) == %a == %a\n", x, y, x+y, y);
    return 0;
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM