简体   繁体   中英

Binary64 floating point addition rounding mode error and behaviors difference 32/64 bits

I noticed a rounding error when I tried to add the two following floating point numbers on an Intel core I7 / I5 :

2.500244140625E+00 + 4503599627370496.00 <=> 0x1.4008p+1 + 0x1.0p+52

The addition, made with two double precision constants by the faddl assembly instruction (when I compile with a 32 bits compiler).

The result I obtains is :

4.50359962737049 8 E+15 = 0x1.000000000000 2 p+52

Instead of :

4.50359962737049 9 E+15 = 0x1.000000000000 3 p+52

(as I was expected and was confirmed by http://weitz.de/ieee/ .)

Demonstration:

0x1.0p+52 = 0x10000000000000.00p+0

0x1.4008p+1 = 0x2.801p+0

0x10000000000000.00p+0 + 0x2.801p+0 = 0x10000000000002.801p+0 (exactly)

0x10000000000002.801p+0 = 0x1.0000000000002 8 01p+52 (exactly)

0x10000000000002.801p+0 = 0x1.000000000000 3 p+52 (after rounding)

I double check and verify in debugging mode that my FPU is in "round to the nearest mode".

Something witch is even more strange is that when I compile my code with a 64 bits compiler, and then the addsd instruction is used, there is no rounding error .

Does anyone can give me reference or explanation about precision differences on 'double' addition on the same FPU but using different instruction set ?

The FPU registers are 80-bit wide, whenever a single or double precision number is loaded with fld and variants it is converted into the double extended precision by default 1 .
Thus fadd usually works with 80-bit numbers.

The SSE registers are format agnostic and the SSE extensions don't support the double extended precision.
For example, addpd works with double precision numbers.


The default rounding mode is round to nearest (even) that means the usual round to nearest but toward the even end in case of a tie (eg 4.5 => 4).

To implement the IEEE 754 requirement to perform arithmetic as with an infinite precision numbers, the hardware need two guards bit and a sticky bit 2


double

I'll write a double precision number as

<sign> <unbiased exponent in decimal> <implicit integer part> <52-bit mantissa> | <guard bits> <sticky bit>

The two numbers

2.500244140625
4503599627370496

are

+  1 1 0100000000 0010000000 0000000000 0000000000 0000000000 00
+ 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 00

The first one is shifted

+ 52 0 0000000000 0000000000 0000000000 0000000000 0000000000 10 |10 1   
+ 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 00 |00 0

The sum is done

+ 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 10 |10 1

Rounding to nearest (even) gives

+ 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 11

because 0 |10 1 is closer to 1 |00 0 than 0 |00 0 .

double extended

The two numbers are

+  1 1 0100000000 0010000000 0000000000 0000000000 0000000000 0000000000 000
+ 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 0000000000 000

The first is shifted

+ 52 0 0000000000 0000000000 0000000000 0000000000 0000000000 1010000000 000 | 10 0
+ 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 0000000000 000 | 00 0

The sum is done

+ 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 1010000000 000 | 10 0

Rounding to nearest (even):

+ 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 1010000000 000

as 0 | 10 0 0 | 10 0 is tie broken to the nearest even.

When this number is then converted from double extended precision to double precision (due to a fstp QWORD [] ) the rounding is repeated using bit 52, 53 and 54 of the double extended mantissa as guards and sticky bits

+ 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 1010000000 000

+ 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 10|100

+ 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 10

because 0|100 is again tie broken to the nearest even.


1 See Chapter 8.5.1.2 of the Intel Manual - Volume 1.
2 The guard bit are extra precision bits retained after one of the number is shifted to make the exponents match. The sticky bit it the OR of bits less significant than the the least guard. See the "on Rounding" section of this page and Goldberg for a format approach.

Thanks to all the comments received by my question I understood what happent and was able to solve the issue.

I will try to summarize it here.

First, the incorrect rounding was confirmed. As mentioned by @MarkDickinson , it can be due to a "double rounding", but I do not know if it can be confirmed. Indeed, it can also be due to others phenomenon such as the ones described in the publication given by Pascal Cuoq .

it seem that the ia32 FPU do not comply perfectly the IEEE754 standard when it is question of rounding certain numbers.

By default, GCC (32 bits version) generate code that uses the FPU to compute additions on Binary64 numbers.

But, on my computer (intel core i7), the SSE unit is also able to make those computations. This unit is use, by default by GCC (64 bit version).

The use of the two following options on the GCC32 command line solves my problem.

-msse2 -mfpmath=sse.

(Thanks you EOF )

First and foremost you are looking at base 10 numbers. you want to talk about floating point and rounding and such needs to be a base 2 discussion.

Second single and double have different length mantissas so obviously for the same number the place where you round varies in decimal 1.2345678 we could round it 1.23 or could round it 1.2346 depending on how many digits we allow one rounds up one rounds down, going with a round up rule.

Since you are base 10 at some point here you are also mixing in possibly compile time conversions, run time operations, and runtime conversions

I take

float x=1.234567;
x=x*2.34;
printf("%f\n",x);

there are compile time conversions, first and formost ascii to double then double to float to be completely accurate to the language (didnt put F's at the end of the constants). then the run time multiply, and then a runtime conversion to ascii, the runtime C library might not be the same one as the compile time, do they honor the same rounding settings, etc. pretty easy to find numbers that you simply declare x=1.234...something and then the next line of code is printf and the printf is not what you fed it, no floating point math other than the runtime float to int.

So before you can ask this question we need to see the binary versions of your numbers, the answer your question should almost automatically fall out from that without further help, but if you still need help then post that and we can look at it. Having a decimal based discussion adds compiler, and library issues and makes it harder to isolate the problem if there is a problem.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM