执行双浮点除法的正确算法是什么？

Question

I'm following the algorithms provided by this paper by Andrew Thall describing algorithms for performing math using the df64 data type, a pair of 32-bit floating point numbers used to emulate the precision of a 64-bit floating point number.我正在遵循Andrew Thall 这篇论文提供的算法，该论文描述了使用 df64 数据类型执行数学运算的算法，这是一对用于模拟 64 位浮点数精度的 32 位浮点数。 However, there appears to be some inconsistencies (mistakes?) in how they've written their Division and Square Root functions.但是，他们编写除法和平方根函数的方式似乎存在一些不一致（错误？）。

This is how the Division function is written in the paper:论文中的除法函数是这样写的：

float2 df64_div(float2 B, float2 A) {
    float xn = 1.0f / A.x;
    float yn = B.x * xn;
    float diff = (df64_diff(B, df64_mult(A, yn))).x;
    float2 prod = twoProd(xn, diffTerm);

    return df64_add(yn, prodTerm);
}

^{The language used to write this code appears to be Cg, for reference, although you should be able to interpret this code in C++ if you treat float2 as though it's merely an alias for struct float2{float x, y;};}^{用于编写此代码的语言似乎是 Cg，以供参考，但如果您将float2视为struct float2{float x, y;};的别名，您应该能够在 C++ 中解释此代码struct float2{float x, y;};} ^{, with some extra syntax to support arithmetic operations directly on the type.} ^{, 有一些额外的语法来支持直接对类型进行算术运算。}

For reference, these are the headers of the functions being used in this code:作为参考，这些是此代码中使用的函数的标头：

float2 df64_add(float2 a, float2 b);
float2 df64_mult(float2 a, float2 b);
float2 df64_diff(/*Not provided...*/);
float2 twoProd(float a, float b);

So a couple of problems immediately stand out:因此，有几个问题立即突出：

diffTerm and prodTerm are never defined. diffTerm和prodTerm从未定义。 There's two variables, diff and prod which are defined, but it's not certain that these are the terms that were intended in this code.有两个变量， diff和prod其定义，但它不是肯定的是，这些都是旨在在这段代码中的术语。
No declaration of df64_diff is provided.没有提供df64_diff声明。 Presumably this is meant to support subtraction;大概这是为了支持减法； but again, this is not clear.但同样，这还不清楚。
df64_mult is a function that does not accept a 32-bit float as an argument; df64_mult是一个不接受 32 位浮点数作为参数的函数； it only supports two pairs of 32-bit floats as arguments.它只支持两对 32 位浮点数作为参数。 It is not clear how the paper expects this function call to compile不清楚论文期望这个函数调用如何编译
Same also for df64_add , which also only accepts pairs of 32-bit floats as arguments, but is here invoked with the first argument being only a single 32-bit float. df64_add也是df64_add ，它也只接受成对的 32 位浮点数作为参数，但在这里调用时第一个参数只有一个 32 位浮点数。

I'm making an educated guess that this is a correct implementation of this code, but because even a correct implementation of this function has unavoidable errors in the computation, I can't tell if it's correct, even if it gives values that "seem" correct:我有根据地猜测这是此代码的正确实现，但因为即使此函数的正确实现在计算中也有不可避免的错误，我无法判断它是否正确，即使它给出的值“似乎“ 正确的：

float2 df64_div(float2 B, float2 A) {
    float xn = 1.0f / A.x;
    float yn = B.x * xn;
    float diff = (df64_diff(B, df64_mult(A, float2(yn, 0)))).x;
    float2 prod = twoProd(xn, diff);

    return df64_add(float2(yn, 0), prod);
}

float2 df64_diff(float2 a, float2 b) {
    return df64_add(a, float2(-b.x, -b.y));
}

So my question is this: is the written implementation of this algorithm as seen in the paper accurate (because it depends on behavior of the Cg language that I'm not aware of?), or isn't it?所以我的问题是：论文中看到的这个算法的书面实现是否准确（因为它取决于我不知道的 Cg 语言的行为？），或者不是吗？ And regardless, is my interpolation of that code a correct implementation of the division algorithm described in the paper?无论如何，我对该代码的插值是否是论文中描述的除法算法的正确实现？

Note: my target language is C++, so although the differences between the languages (for this kind of algorithm) are minor, my code is written in C++ and I'm looking for correctness for the C++ language.注意：我的目标语言是 C++，所以虽然语言之间的差异（对于这种算法）很小，但我的代码是用 C++ 编写的，我正在寻找 C++ 语言的正确性。

Answer 1

Reviewing the Pseudocode Algorithm as written in the book appears to support the C++ implementation of this algorithm, although my unfamiliarity with Cg means I can't prove that that implementation is correct for Cg.回顾书中编写的伪代码算法似乎支持该算法的 C++ 实现，尽管我对 Cg 的不熟悉意味着我无法证明该实现对于 Cg 是正确的。

So breaking down these steps into plain english:因此，将这些步骤分解为简单的英语：

The function takes two parameters, each of which are [pseudo-]double precision floating point values, and where the second parameter is not equal to 0该函数有两个参数，每个参数都是[伪]双精度浮点值，其中第二个参数不等于0
The variable x _n is assigned the arithmetic reciprocal of the higher order component of the [pseudo-]double divisor, calculated using single precision floating point math变量 x _n被赋值为 [伪] 双除数的高阶分量的算术倒数，使用单精度浮点数学计算
The variable y _n is assigned the product of the higher order component of the [pseudo-]double dividend and x _n , calculated using single precision floating point math变量 y _n被赋值为[伪]双被除数的高阶分量和 x _n的乘积，使用单精度浮点数学计算
The product of the [pseudo-]double Divisor and y _n is calculated计算[伪-]双除数和y _n的乘积
- This is the first tricky part, because the paper doesn't describe an algorithm for [pseudo-]double x single multiplication.这是第一个棘手的部分，因为该论文没有描述 [伪] 双 x 单乘法的算法。 We can see in the Cg algorithm that the Cg algorithm clearly maps to this step 1-to-1, but the Cg rules for promoting a scalar value to a vector value are unknown.我们在Cg算法中可以看到，Cg算法明确映射到这一步1对1，但是将标量值提升为向量值的Cg规则是未知的。
- What we can say, however, is that we do have a function for multiplying a double by a double, and a single can be promoted to a double by padding its lower order component with 0, so we can do that.然而，我们可以说的是，我们确实有一个将双精度乘以双精度的函数，并且可以通过用 0 填充其低阶分量来将单精度提升为双精度，因此我们可以这样做。
The difference between the Dividend and the product calculated in step 4 is calculated, and only the higher order component is kept as a single-precision floating point value计算Dividend与第4步计算的乘积之差，只保留高阶部分为单精度浮点值
- What makes this tricky is that the paper doesn't describe an algorithm for subtraction.使这变得棘手的是，该论文没有描述减法算法。 However, it does describe an algorithm for converting a [IEEE754-]double into a [pseudo-]double, and an observation we can make is that negative [IEEE754-]doubles, when converted, have negative values for both its higher order and lower order components.然而，它确实描述了一种将 [IEEE754-]double 转换为 [pseudo-]double 的算法，我们可以观察到，负 [IEEE754-]double 转换后，其高阶和低阶组件。 So logically, a [pseudo-]double can be negated by simply negating both of its components.所以从逻辑上讲，一个 [伪-]double 可以通过简单地否定它的两个组件来否定。 And a negated number added is mathematically equivalent to subtraction, so we can build a subtraction algorithm using this knowledge.加的负数在数学上等同于减法，因此我们可以使用这些知识构建减法算法。
The product of x _n and step 5 is performed, preserving the extended precision that would otherwise be lost in a single x single multiplication.执行 x _n和步骤 5 的乘积，保留扩展精度，否则将在单个 x 单乘法中丢失。
- The twoProd function exists for exactly this purpose. twoProd函数正是为了这个目的而存在的。
The sum of step 6 and y _n is calculated计算步骤 6 和 y _n的总和
- Again, we can use the [pseudo-]double addition algorithm if we simply promote y _n to a [pseudo-]double by padding the lower order component with 0同样，如果我们通过用 0 填充低阶分量来简单地将 y _n提升为 [伪] 双精度值，我们可以使用 [伪-] 双精度加法算法
The result of step 7 is the returned value第七步的结果就是返回值

So understanding this algorithm we can map each of these steps directly to the C++ algorithm I wrote:所以理解这个算法，我们可以将这些步骤中的每一个直接映射到我写的 C++ 算法：

//(1) Takes two [pseudo-]doubles, returns a [pseudo-]double
float2 df64_div(float2 B, float2 A) {
    //(2) single float divided by single float
    float xn = 1.0f / A.x;
    // (3) single float multiplied by single float
    float yn = B.x * xn;
    //                        (4) double x double multiplication
    //                                       (4a) yn promoted to [pseudo-]double
    //            (5) subtraction                           (5a) only higher order component kept
    float diff = (df64_diff(B, df64_mult(A, float2(yn, 0)))).x;
    // (6) single x single multiplication with extra precision preserved using twoProd
    float2 prod = twoProd(xn, diff);
    // (7) adding higher-order division to lower order division
    //              (7a) yn promoted to [pseudo-]double
    // (8) value is returned
    return df64_add(float2(yn, 0), prod);
}

float2 df64_diff(float2 a, float2 b) {
    //                 (5a) negating both components is a logical negation of the whole number
    return df64_add(a, float2(-b.x, -b.y));
}

From this, we can conclude that this is a correct implementation of the algorithm described in this paper, upheld by some testing I've done to validate that performing these operations in this manner yields results that appear to be correct.由此，我们可以得出结论，这是本文中描述的算法的正确实现，我进行了一些测试，以验证以这种方式执行这些操作会产生看起来正确的结果。

Answer 2

Xirema's answer provides a faithful rendering of Thall's high-radix long-hand division algorithm into C++. Xirema 的回答将 Thall 的高基数长手除法算法忠实地呈现为 C++。 Based on fairly extensive testing against a higher-precision reference, I found its maximum relative error to be on the order of 2 ^-45 , provided there are no underflows in intermediate computation.基于对更高精度参考的相当广泛的测试，我发现它的最大相对误差约为 2 ^-45 ，前提是中间计算中没有下溢。

On platforms that provide a fused multiply-add operation ( FMA ), the following Newton-Raphson-based division algorithm due to Nagai et.在提供融合乘加运算 ( FMA ) 的平台上， Nagai 等人提出了以下基于 Newton-Raphson 的除法算法。 al. 阿尔。 is likely to be more efficient and achieves identical accuracy in my testing, that is, maximum relative error of 2 ^-45 .在我的测试中可能更有效并达到相同的精度，即最大相对误差为 2 ^-45 。

/*
  T. Nagai, H. Yoshida, H. Kuroda, Y. Kanada, "Fast Quadruple Precision 
  Arithmetic Library on Parallel Computer SR11000/J2." In: Proceedings 
  of the 8th International Conference on Computational Science, ICCS '08, 
  Part I, pp. 446-455.
*/
float2 div_df64 (float2 a, float2 b)
{
    float2 t, c;
    float r, s;
    r = 1.0f / b.x;
    t.x = a.x * r;
    s = fmaf (-b.x, t.x, a.x);
    t.x = fmaf (r, s, t.x);
    t.y = fmaf (-b.x, t.x, a.x);
    t.y = a.y + t.y;
    t.y = fmaf (-b.y, t.x, t.y);
    s = r * t.y;
    t.y = fmaf (-b.x, s, t.y);
    t.y = fmaf (r, t.y, s);
    c.x = t.x + t.y;
    c.y = (t.x - c.x) + t.y;
    return c;
}

执行双浮点除法的正确算法是什么？

问题描述

2 个解决方案

解决方案1
2 2020-02-26 22:56:53

解决方案2
2 2020-02-27 19:33:26

执行双浮点除法的正确算法是什么？

问题描述

2 个解决方案

解决方案1 2 2020-02-26 22:56:53

解决方案2 2 2020-02-27 19:33:26

解决方案1
2 2020-02-26 22:56:53

解决方案2
2 2020-02-27 19:33:26