简体繁体 English

浮点数的“底层划分”（例如在Python中）是否会造成不准确？

[英]Can “floor division” of floating-point numbers (e.g. in Python) cause innaccuracy?

原文 2018-05-12 15:49:26 8 1 python/ floating-point/ language-agnostic/ rounding/ floating-accuracy

Guido van Rossum has written a blog post explaining why, in Python, integer division (for example, a // b ) is "floor division" - the quotient is rounded towards negative infinity. Guido van Rossum撰写了一篇博文，解释了为什么在Python中，整数除法（例如， a // b ）是“底层划分” - 商指向负无穷大。 Correspondingly, the sign of a % b matches the sign of b . 相应地， a % b的符号与b的符号匹配。

This differs from C, where the quotient is rounded towards zero and the result of a % b has the sign of a . 这不同于C，其中所述商值向零舍入和的结果a % b具有的符号a 。

Python also uses floor division, and the corresponding "sign-of-modulo matches sign-of-divisor", for floating-point numbers. 对于浮点数，Python也使用地板划分和相应的“模数符号匹配符号的符号”。 The blog post claims that this can be inaccurate in certain cases (where C's "sign-of-modulo matches sign-of-dividend" would be accurate). 博客文章声称在某些情况下这可能是不准确的（其中C的“模数符号匹配分红符号”将是准确的）。 Is this true? 这是真的？ Are there any concrete examples? 有没有具体的例子？

1 个解决方案

Introduction 介绍

The following proof is longer than I want it to be, but this question has gone several days without being answered, and it deserves an answer. 以下证明比我想要的要长，但这个问题已经过了好几天而没有得到回答，值得回答。

Before going into the proof, let me address this intuitively. 在进入证明之前，让我直观地解决这个问题。 If we define modulo to return a result for x % y that is in [− y /2, + y /2] (for positive y ), then the result is always either x or is reduced by adding (positive or negative) multiples of y . 如果我们定义modulo来返回x ％ y的结果，该结果在[ - y / 2，+ y / 2]中（对于正y ），则结果总是为x或通过添加（正或负）倍数来减少Y的。 If the result is x , it is representable since x is given in a representable form. 如果结果是x ，则它是可表示的，因为x是以可表示的形式给出的。 If the result is reduced, then it is necessarily a multiple of the position value of the low digit in y , and its greatest digit position is no greater than the greatest digit position in y , and hence it fits in the floating-point format and is representable. 如果结果减小，则它必然是y中低位数的位置值的倍数，并且其最大位数不大于y中的最大位数，因此它适合浮点格式和是可以代表的。

On the other hand, if we define modulo to return a result for x % y that is in [0, y ), then a small negative x must be increased by adding y . 在另一方面，如果我们定义模返回结果对于x％Y是在[0，y）时，则一个小的负x必须增加通过增加收率 When x is small, it may have digits in lower positions than y , and, when it does, the result of adding y must have a non-zero digit in the lowest position that x does, but it must also have a non-zero digit in a higher position than the small x does (because y is adding a digit in a higher position). 当x很小时，它的位置可能比y低，而且，当它为0时，加y的结果必须在x的最低位置有一个非零数字，但它必须也有一个非零数字位于比小x更高的位置（因为y在更高的位置添加一个数字）。 Therefore, the result needs more digits than fit in the floating-point format, and the result is not representable. 因此，结果需要比浮点格式更多的数字，并且结果不可表示。

A simple example is −2 ⁻⁶⁰ % 1. The mathematical result is 1−2 ⁻⁶⁰ , but this cannot be represented with just 53 bits in the significand; 一个简单的例子是-2 ^-60 ％1。数学结果是^1-2-60 ，但这不能用有效数字中的53位表示; it needs bits with position values from 2 ⁻¹ to 2 ⁻⁶⁰ , which requires 60 bits. 它需要位置值为2 ^-1到2 ^-60的位，这需要60位。

Symmetric Modulo Is Exact 对称模数是完全正确的

First, let's see that symmetric modulo defined so that x % y is in [− y /2, + y /2] for positive y always has a representable result. 首先，让我们看看对称模数被定义为x ％ y在[ - y / 2，+ y / 2]中，因为正y总是具有可表示的结果。 I will also assume x is positive, but the arguments for negative x and/or negative y are symmetric, and the results for x = 0 are trivial. 我还假设x是正的，但负x和/或负y的参数是对称的， x = 0的结果是微不足道的。

x % y is defined to be r such that r = x − q • y for some integer q , and typically we define some constraints on r or q so that r is uniquely determined (or perhaps at least usually uniquely determined with some flexibility when the result is at an endpoint of some interval). x ％ y被定义为r ，使得对于某个整数q ， r = x - q · y ，并且通常我们在r或q上定义一些约束，使得r是唯一确定的（或者可能至少通常唯一地确定具有一定的灵活性）结果是某个间隔的终点）。 Since q is an integer, if both x and y are integer multiples of some number g (which might or might not be an integer), then r is also an integer multiple of g . 由于q是整数，如果x和y都是某个数g的整数倍（可能是也可能不是整数），则r也是g的整数倍。

In a floating-point format, a number is represented using a sign, a base b (which is an integer greater than 1), a fixed number p of base- b digits, and an exponent e . 在一个浮点格式，一些是使用符号来表示，基部B（其是大于1的整数），的碱基b位固定数p和指数e。 The number represented is ± digits × b ^e . 表示的数字是± 数字 × b ^e 。 Let's write the individual digits as d _{p −1} d _{p −2} d _{p −3} … d ₂ d ₁ d ₀ . 让我们将各个数字写为d _{p -1} d _{p -2} d _{p -3} ... d ₂ d ₁ d ₀ 。

Consider the inputs x and y . 考虑输入x和y 。 Using x _i to denote the base- b digits used in representing x , and e _x for the exponent used in representing x , and similarly for y , we have x = x _{p −1} … x ₀ × b ^{e _x} and y = y _{p −1} … y ₀ × b ^{e _y} . 用x _i到表示在表示X中使用的碱- b位数，和E _X为在表示X所使用的指数，同样地，对于Y，我们有X = X _{P -1} ... X _0×B ^{E _X}和Y = Y _{p -1} ... y ₀ × b ^{e _y} 。

Observe that both x and y are multiples of the lesser of b ^{e _x} and b ^{e _y} , and so r must be too. 观察到x和y都是b ^{e _x}和b ^{e _y中}较小者的倍数，因此r必须也是。

If b ^{e _y} ≤ b ^{e _x} , then r is a multiple of b ^{e _y} . 若b ^{E ^_Y≤Bé} ^_的x，则r为b ^{E _Y}的倍数。 Also, | 另外，| r | r | is necessarily less than y . 必然少于y 。 This implies we can represent r as ± r _{p −1} … r ₀ × b ^{e _y} — r is small enough that these digits with the exponent e _y are large enough to represent its value, and, because it is a multiple of b ^{e _y} , it does not need any digits with a smaller exponent. 这意味着我们可以代表R作为±R _{P -1} ... r ₀ 的 ×B ^{E _Y} - r是足够小，这些数字与指数e _y是足够大，以代表它的价值，并且，因为它是^Bé的倍数^_y ，它不需要任何指数较小的数字。 Thus, r is representable in the floating-point format. 因此， r可以浮点格式表示。

Now consider b ^{e _x} < b ^{e _y} . 现在考虑^_BéX <B ^{E _Y。} Also suppose that y is normalized, by which we mean that its leading digit, y _{p −1} , is not zero. 还假设y被归一化，我们的意思是它的前导数字y _{p -1}不为零。 (If it is zero, find a normalized representation of y by decreasing its exponent to shift a non-zero digit into the leading position. Then the above paragraph applies. If y has no non-zero digits, it is zero, and x % y is not defined.) Then x < y . （如果为零，则通过减小其指数来找到y的归一化表示，以将非零数字移动到前导位置。然后上述段落适用。如果y没有非零数字，则为零，并且x ％ y未定义。）然后x < y 。 In this case, r is either x or x − y , because one of these two is in [− y /2, + y /2]. 在这种情况下， r是x或x - y ，因为这两者中的一个是[ - y / 2，+ y / 2]。 If r is x , then it is representable since x is representable. 如果r是x ，那么它是可表示的，因为x是可表示的。 If r is x − y , then x ≥ ½ y , and | 如果r是X - Y，则x≥½y和| r | r | ≤ x . ≤X。 Since r is a multiple of b ^{e _x} and | 因为r是b ^{e _x}和|的倍数 r | r | < x , we must be able to represent r as ± r _{p −1} … r ₀ × b ^{e _x} . < x ，我们必须能够将r表示为± r _{p -1} ... r ₀ × b ^{e _x} 。

Asymmetric Modulo May Be Inexact 不对称模数可能不精确

The above proof tells us that symmetric modulo is exact because the result is always either the unchanged x or x reduced in magnitude sufficiently that all the required digits fit in the floating-point format. 上面的证明告诉我们，对称模数是精确的，因为结果总是要么未改变的x或x的数量减少到足以使所有所需的数字都适合浮点格式。 And this tells us how to break modulo defined such that x % y is in [0, y ): Select an x that must be increased in magnitude. 这告诉我们如何打破模数定义使得x ％ y在[0， y ]中：选择一个必须增加幅度的x 。

We have y = y _{p −1} … y ₀ × b ^{e _y} . 我们有y = y _{p -1} ... y ₀ × b ^{e _y} 。 Let y be normalized, as described above. 如上所述，令y归一化。 For x , select any value that is negative, has e _x < e _y , and is not a multiple of b ^{e _y} (meaning that at least one of its digits from x _{e _y −1− e _x} to x ₀ is not zero). 对于x ，选择任何负值， e _x < e _y ，并且不是b ^{e _y}的倍数（意味着至少有一个数字从x _{e _y -1- e _x}到x ₀不为零）。 In some cases where the leading digit of y is 1 and a borrow from it occurs, the result may be representable. 在某些情况下， y的前导数字为1并且从中借用，结果可能是可表示的。 Otherwise, the greatest digit position it needs is the same as y ’s greatest digit position, and it needs digits below b ^{e _y} , and therefore it is not representable. 否则，它需要最大的数位位置是相同的为y的最大数位位置，它需要下文B ^{E _Y}位数，因此它是不能表示。