Binary64浮点加法舍入模式错误和行为差异32/64位

Question

I noticed a rounding error when I tried to add the two following floating point numbers on an Intel core I7 / I5 : 当我尝试在Intel核心I7 / I5上添加以下两个浮点数时，我注意到一个舍入错误：

2.500244140625E+00 + 4503599627370496.00 <=> 0x1.4008p+1 + 0x1.0p+52 2.500244140625E + 00 + 4503599627370496.00 <=> 0x1.4008p + 1 + 0x1.0p + 52

The addition, made with two double precision constants by the faddl assembly instruction (when I compile with a 32 bits compiler). 加法faddl由faddl汇编指令使用两个double精度常量进行（当我使用32位编译器进行编译时）。

The result I obtains is : 我得到的结果是：

4.50359962737049 8 E+15 = 0x1.000000000000 2 p+52 4.50359962737049 8 E + 15 = 0x1.000000000000 2 p + 52

Instead of : 代替：

4.50359962737049 9 E+15 = 0x1.000000000000 3 p+52 4.50359962737049 9 E + 15 = 0x1.000000000000 3 p + 52

(as I was expected and was confirmed by http://weitz.de/ieee/ .) （如我所料，并已通过http://weitz.de/ieee/确认）。

Demonstration: 示范：

0x1.0p+52 = 0x10000000000000.00p+0 0x1.0p + 52 = 0x10000000000000.00p + 0

0x1.4008p+1 = 0x2.801p+0 0x1.4008p + 1 = 0x2.801p + 0

0x10000000000000.00p+0 + 0x2.801p+0 = 0x10000000000002.801p+0 (exactly) 0x10000000000000.00p + 0 + 0x2.801p + 0 = 0x10000000000002.801p + 0 （完全）

0x10000000000002.801p+0 = 0x1.0000000000002 8 01p+52 (exactly) 0x10000000000002.801p + 0 = 0x1.0000000000002 8 01p + 52 （完全）

0x10000000000002.801p+0 = 0x1.000000000000 3 p+52 (after rounding) 0x10000000000002.801p + 0 = 0x1.000000000000 3 p + 52 （四舍五入后）

I double check and verify in debugging mode that my FPU is in "round to the nearest mode". 我仔细检查并在调试模式下验证我的FPU是否处于“四舍五入到最近的模式”。

Something witch is even more strange is that when I compile my code with a 64 bits compiler, and then the addsd instruction is used, there is no rounding error . 更为奇怪的是，当我使用64位编译器编译代码，然后使用addsd指令时，没有舍入错误。

Does anyone can give me reference or explanation about precision differences on 'double' addition on the same FPU but using different instruction set ? 有谁能给我关于相同FPU但使用不同指令集的“双”加法精度差异的参考或解释？

Answer 1

The FPU registers are 80-bit wide, whenever a single or double precision number is loaded with fld and variants it is converted into the double extended precision by default ¹ . FPU寄存器为80位宽，每当将fld及其变体装入单精度或双精度数字时，默认情况下¹会将其转换为双精度扩展精度。
Thus fadd usually works with 80-bit numbers. 因此， fadd通常适用于80位数字。

The SSE registers are format agnostic and the SSE extensions don't support the double extended precision. SSE寄存器与格式无关，SSE扩展不支持双精度扩展精度。
For example, addpd works with double precision numbers. 例如， addpd使用双精度数字。

The default rounding mode is round to nearest (even) that means the usual round to nearest but toward the even end in case of a tie (eg 4.5 => 4). 默认的四舍五入模式是四舍五入到最接近（偶数） ，这意味着通常的四舍五入到最接近，但在出现平局的情况下朝着偶数结束（例如4.5 => 4）。

To implement the IEEE 754 requirement to perform arithmetic as with an infinite precision numbers, the hardware need two guards bit and a sticky bit ² 要实现IEEE 754要求以无限精度数字执行算术运算，硬件需要两个保护位和一个粘性位²

double 双

I'll write a double precision number as 我将写一个双精度数字为

<sign> <unbiased exponent in decimal> <implicit integer part> <52-bit mantissa> | <guard bits> <sticky bit>

The two numbers 两个数字

2.500244140625
4503599627370496

are 是

+  1 1 0100000000 0010000000 0000000000 0000000000 0000000000 00
+ 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 00

The first one is shifted 第一个转移了

+ 52 0 0000000000 0000000000 0000000000 0000000000 0000000000 10 |10 1   
+ 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 00 |00 0

The sum is done 总和完成

+ 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 10 |10 1

Rounding to nearest (even) gives 四舍五入到最接近的（偶数）

+ 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 11

because 0 |10 1 is closer to 1 |00 0 than 0 |00 0 . 因为0 |10 1比0 |00 0更接近1 |00 0 0 |00 0 。

double extended 双重扩展

The two numbers are 这两个数字是

+  1 1 0100000000 0010000000 0000000000 0000000000 0000000000 0000000000 000
+ 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 0000000000 000

The first is shifted 首先是转移

+ 52 0 0000000000 0000000000 0000000000 0000000000 0000000000 1010000000 000 | 10 0
+ 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 0000000000 000 | 00 0

The sum is done 总和完成

+ 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 1010000000 000 | 10 0

Rounding to nearest (even): 四舍五入到最接近的（偶数）：

+ 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 1010000000 000

as 0 | 10 0 为0 | 10 0 0 | 10 0 is tie broken to the nearest even. 0 | 10 0被平局到最接近的偶数。

When this number is then converted from double extended precision to double precision (due to a fstp QWORD [] ) the rounding is repeated using bit 52, 53 and 54 of the double extended mantissa as guards and sticky bits 然后将此数字从双精度扩展精度转换为双精度精度时（由于fstp QWORD [] ），使用双精度扩展尾数的第52、53和54位作为保护和粘性位，重复进行舍入

+ 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 1010000000 000

+ 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 10|100

+ 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 10

because 0|100 is again tie broken to the nearest even. 因为0|100再次被平分到最接近的偶数。

¹ See Chapter 8.5.1.2 of the Intel Manual - Volume 1. ¹请参阅《英特尔手册-第1卷》第8.5.1.2章。
² The guard bit are extra precision bits retained after one of the number is shifted to make the exponents match. ²保护位是在数字之一移位以使指数匹配后保留的超精度位。 The sticky bit it the OR of bits less significant than the the least guard. 粘性位是比最小防护位低的位的或。 See the "on Rounding" section of this page and Goldberg for a format approach. 请参阅本页的“四舍五入”部分和Goldberg的格式方法。

Answer 2

Thanks to all the comments received by my question I understood what happent and was able to solve the issue. 感谢我的问题收到的所有评论，我了解了发生的情况并能够解决问题。

I will try to summarize it here. 我将在这里总结一下。

First, the incorrect rounding was confirmed. 首先，确认不正确的舍入。 As mentioned by @MarkDickinson , it can be due to a "double rounding", but I do not know if it can be confirmed. 如@MarkDickinson所述，这可能是由于“双舍入”引起的，但我不知道是否可以确认。 Indeed, it can also be due to others phenomenon such as the ones described in the publication given by Pascal Cuoq . 确实，这也可能是由于其他现象，例如Pascal Cuoq给出的出版物中描述的现象。

it seem that the ia32 FPU do not comply perfectly the IEEE754 standard when it is question of rounding certain numbers. ia32 FPU在四舍五入某些数字时似乎并不完全符合IEEE754标准。

By default, GCC (32 bits version) generate code that uses the FPU to compute additions on Binary64 numbers. 默认情况下，GCC（32位版本）生成使用FPU来计算Binary64数字上的加法的代码。

But, on my computer (intel core i7), the SSE unit is also able to make those computations. 但是，在我的计算机（Intel Core i7）上，SSE单元也能够进行这些计算。 This unit is use, by default by GCC (64 bit version). 默认情况下，GCC（64位版本）使用此单位。

The use of the two following options on the GCC32 command line solves my problem. 在GCC32命令行上使用以下两个选项可以解决我的问题。

-msse2 -mfpmath=sse. -msse2 -mfpmath = sse。

(Thanks you EOF ) （感谢你EOF ）

Answer 3

First and foremost you are looking at base 10 numbers. 首先，您要查看以10为基数的数字。 you want to talk about floating point and rounding and such needs to be a base 2 discussion. 您想讨论浮点和舍入，因此需要以2为基础进行讨论。

Second single and double have different length mantissas so obviously for the same number the place where you round varies in decimal 1.2345678 we could round it 1.23 or could round it 1.2346 depending on how many digits we allow one rounds up one rounds down, going with a round up rule. 第二个单精度和双精度尾数的尾数不同，因此对于相同的数字，显然您舍入的位数以十进制1.2345678进行舍入，我们可以将其舍入为1.23或将其舍入为1.2346，具体取决于我们允许多少个数字向上舍入一位向下舍入。汇总规则。

Since you are base 10 at some point here you are also mixing in possibly compile time conversions, run time operations, and runtime conversions 由于您此时的基础是10，因此您还会混入编译时转换，运行时操作和运行时转换

I take 我拿

float x=1.234567;
x=x*2.34;
printf("%f\n",x);

there are compile time conversions, first and formost ascii to double then double to float to be completely accurate to the language (didnt put F's at the end of the constants). 有编译时间转换，首先将ascii加倍，然后将double加倍，将float浮动，以完全准确地理解该语言（将F放在常量的末尾）。 then the run time multiply, and then a runtime conversion to ascii, the runtime C library might not be the same one as the compile time, do they honor the same rounding settings, etc. pretty easy to find numbers that you simply declare x=1.234...something and then the next line of code is printf and the printf is not what you fed it, no floating point math other than the runtime float to int. 然后将运行时间相乘，然后将运行时转换为ascii，则运行时C库可能与编译时不同，它们是否接受相同的舍入设置，等等。很容易找到只需声明x =的数字1.234 ...某些东西，然后下一行代码是printf，而printf不是您提供的内容，除了运行时浮点数是int之外，没有浮点数。

So before you can ask this question we need to see the binary versions of your numbers, the answer your question should almost automatically fall out from that without further help, but if you still need help then post that and we can look at it. 因此，在提出这个问题之前，我们需要查看数字的二进制形式，这个问题的答案应该在没有其他帮助的情况下几乎自动消失。但是，如果您仍然需要帮助，则可以发布它，我们可以进行研究。 Having a decimal based discussion adds compiler, and library issues and makes it harder to isolate the problem if there is a problem. 进行基于十进制的讨论会增加编译器和库问题，并且如果存在问题，则更难找出问题。

Binary64浮点加法舍入模式错误和行为差异32/64位

问题描述

3 个解决方案

解决方案1
2 已采纳 2017-05-11 14:24:28

double 双

double extended 双重扩展

解决方案2
2 2017-05-12 08:51:02

解决方案3
-5 2017-05-10 17:28:40

Binary64浮点加法舍入模式错误和行为差异32/64位

问题描述

3 个解决方案

解决方案1 2 已采纳 2017-05-11 14:24:28

double 双

double extended 双重扩展

解决方案2 2 2017-05-12 08:51:02

解决方案3 -5 2017-05-10 17:28:40

解决方案1
2 已采纳 2017-05-11 14:24:28

解决方案2
2 2017-05-12 08:51:02

解决方案3
-5 2017-05-10 17:28:40