简体   繁体   English

为什么从最大浮点数到最小浮点数相加不如从最小浮点数到最大浮点数精确?

[英]Why adding from biggest to smallest floating-point numbers is less accurate than adding from smallest to biggest?

My Java textbook states that adding from biggest to smallest is less accurate than adding from smallest to biggest when dealing with floating-point numbers. 我的Java教科书指出,处理浮点数时,从最大到最小的加法不如从最小到最大的加法精确。 However, he doesn't go on to clearly explain why this is the case. 但是,他没有继续明确解释为什么会这样。

Floating point has a limited number of digits of precision (6 for float , 15 for double ). 浮点数的精度位数有限( float为6, double精度数为15)。 The calculation 计算

1.0e20d + 1 

gives the result 1.0e20 because there is not enough precision to represent the number 给出结果1.0e20因为没有足够的精度来表示数字

100,000,000,000,000,000,001

If you start with the largest number then any numbers more than n orders of magnitude smaller (where n is 6 or 15 depending on type) will not contribute to the sum at all. 如果您从最大的数字开始,那么任何比n小几个数量级(其中n是6或15,取决于类型)的数字将根本不占总和。 Start with the smallest and you might sum several smaller numbers into one that will affect the final total. 从最小的数字开始,您可能会将几个较小的数字求和,这将影响最终总数。

Where it would make a difference is, for example 例如,它会有所作为的地方

1.0e20 + 1.0e4 + 6.0e4 + 3.0e4

Assuming it's exactly 15 decimal digits precision (it's not, see the linked article below, but 15 is good enough for the example), if you start with the larger number, none of the others will make a difference because they're too small. 假定它的精确度为15位小数位数(不是,请参见下面的链接文章,但对于示例来说,15位就足够了),如果您以较大的数字开头,则其他任何一个都不会有所作为,因为它们太小了。 If you start with the smaller ones, they add up to 1.0e5, which IS large enough to affect the final total. 如果从较小的开始,则它们的总和为1.0e5,该大小足以影响最终总数。

Please read What Every Computer Scientist Should Know About Floating-Point Arithmetic 请阅读每位计算机科学家应该了解的有关浮点运算的知识

An excellent explanation is available in section 4.2 of "Accuracy and Stability of Numerical Algorithms" by Nick Higham. Nick Higham的第4.2节“数值算法的准确性和稳定性”中提供了很好的解释。 Below is my casual interpretation of this: 以下是我对此的随意解释:

The key property of floating point is that when the result of an individual operation cannot be exactly represented, it is rounded to the nearest value. 浮点数的关键属性是,当无法精确表示单个操作的结果时,会将其舍入到最接近的值。 This has many consequences, namely that addition (and multiplication) is no longer associative . 这具有许多后果,即加法(和乘法)不再是关联的

The other main thing to note is that the error (the difference between the true value and the rounded value) is relative. 另一个要注意的主要问题是误差 (真实值和舍入值之间的差)是相对的。 If we use square brackets ( [] )to denote this rounding operation, then we have the property for any number x : 如果我们使用方括号( [] )表示此舍入运算,则我们拥有任意数字x的属性:

|[x] - x| <= ϵ |[x]| / 2

Where ϵ is the machine epsilon . ϵ是机器ε

So suppose that we want to sum up [x1, x2, x3, x4] . 因此,假设我们要总结[x1, x2, x3, x4] The obvious way to do it is via 显而易见的方法是通过

s2 = x1 + x2
s3 = s2 + x3 = x1 + x2 + x3
s4 = s3 + x4 = x1 + x2 + x3 + x4

As noted above, we can't do this exactly, so we're actually doing: 如上所述,我们无法完全做到这一点,因此实际上是在做:

t2 = [x1 + x2]
t3 = [t2 + x3] = [[x1 + x2] + x3]
t4 = [t3 + x4] = [[[x1 + x2] + x3] +x4]

So how big is the resulting error |t4 - s4| 因此,产生的误差|t4 - s4|多大? ? Well we know that 好吧,我们知道

|t2 - s2| = |[x1+x2] - (x1+x2)| <= ϵ/2 |t2|

Now by the Triangle inequality we can write 现在通过三角不等式,我们可以写

|t3 - s3| =  |[t2+x3] - (t2+x3) + (t2+x3) - (s2+x3)| 
          <= |[t2+x3] - (t2+x3)| + |t2 - s2|
          <= ϵ/2 (|t3| + |t2|)

And again: 然后再次:

|t4 - s4| =  |[t3+x4] - (t3+x4) + (t3+x4) - (s3+x4)| 
          <= |[t3+x4] - (t3+x4)| + |t3 - s3|
          <= ϵ/2 (|t4| + |t3| + |t2|)

This leads to Higham's general advice: 这导致了Higham的一般建议:

In designing or choosing a summation method to achieve high accuracy, the aim should be to minimize the absolute values of the intermediate sums ti . 在设计或选择求和方法以实现高精度时,目标应该是最小化中间和ti的绝对值。

So if you're doing sequential summation (like we did above), then you want to start with the smallest elements, as that will give you the smallest intermediate sums. 因此,如果您要进行顺序求和(就像我们上面所做的那样),那么您希望从最小的元素开始,因为这将为您提供最小的中间和。

But that is not the only option! 但这不是唯一的选择! There is also pairwise summation , where you add up pairs in a tree form (eg [[x1 + x2] + [x3 + x4]] ), though this requires allocating a work array. 还有成对求和 ,您可以以树的形式(例如[[x1 + x2] + [x3 + x4]] )添加对,尽管这需要分配工作数组。 You can also utilise SIMD vectorisation , by storing the intermediate sum in a vector, which can give both speed and accuracy improvements. 您还可以通过将中间和存储在矢量中来利用SIMD矢量化 ,这可以同时提高速度和精度。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 java合并排序从大到小 - java mergesort from the biggest to the smallest 计算浮点数时最大可能的舍入误差 - Biggest possible rounding error when computing floating-point numbers 从最小到最大对二维整数数组的行进行排序 - Sorting rows of a 2d integer array from smallest to biggest 使用Java将二进制搜索树从最大数量打印到最小数量 - Print Binary search tree from biggest number to smallest using Java java 代码只需 6 个步骤即可找到 5 个(不是从数组中)中的最大和最小数字,每个步骤都需要在 2 个数字之间交换 - java code to find the biggest and smallest numbers out of 5 (not from array) using only 6 steps, each step you need to swap between 2 numbers 无需任何方法即可选择列表中最小和最大的数字 - Picking the smallest and biggest numbers of a list without any methods 尝试使用冒泡排序将随机整数数组从最大到最小排序 - Trying to sort an array of randomized integers from biggest to smallest using a bubble sort 返回3个整数,每两个整数之间的差值从数组中的最小到最大 - Return 3 integers that have the same difference between each two from smallest to biggest in an array 从二维数组中随机的最大和最小数中查找二维数组索引 - Finding 2D array index from the biggest and smallest number randomed in the 2D array 如何使用这些索引对数组进行排序(索引)以获得从最小到最大值排序的原始数组 - how to sort (indexes) of an array to get the original array sorted from the smallest to the biggest value by using those indexes
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM