[英]What is the correct algorithm to perform double-float division?
I'm following the algorithms provided by this paper by Andrew Thall describing algorithms for performing math using the df64 data type, a pair of 32-bit floating point numbers used to emulate the precision of a 64-bit floating point number.我正在遵循Andrew Thall 这篇论文提供的算法,该论文描述了使用 df64 数据类型执行数学运算的算法,这是一对用于模拟 64 位浮点数精度的 32 位浮点数。 However, there appears to be some inconsistencies (mistakes?) in how they've written their Division and Square Root functions.
但是,他们编写除法和平方根函数的方式似乎存在一些不一致(错误?)。
This is how the Division function is written in the paper:论文中的除法函数是这样写的:
float2 df64_div(float2 B, float2 A) {
float xn = 1.0f / A.x;
float yn = B.x * xn;
float diff = (df64_diff(B, df64_mult(A, yn))).x;
float2 prod = twoProd(xn, diffTerm);
return df64_add(yn, prodTerm);
}
The language used to write this code appears to be Cg, for reference, although you should be able to interpret this code in C++ if you treat float2
as though it's merely an alias for struct float2{float x, y;};
用于编写此代码的语言似乎是 Cg,以供参考,但如果您将
float2
视为struct float2{float x, y;};
的别名,您应该能够在 C++ 中解释此代码struct float2{float x, y;};
, with some extra syntax to support arithmetic operations directly on the type. , 有一些额外的语法来支持直接对类型进行算术运算。
For reference, these are the headers of the functions being used in this code:作为参考,这些是此代码中使用的函数的标头:
float2 df64_add(float2 a, float2 b);
float2 df64_mult(float2 a, float2 b);
float2 df64_diff(/*Not provided...*/);
float2 twoProd(float a, float b);
So a couple of problems immediately stand out:因此,有几个问题立即突出:
diffTerm
and prodTerm
are never defined. diffTerm
和prodTerm
从未定义。 There's two variables, diff
and prod
which are defined, but it's not certain that these are the terms that were intended in this code.diff
和prod
其定义,但它不是肯定的是,这些都是旨在在这段代码中的术语。df64_diff
is provided.df64_diff
声明。 Presumably this is meant to support subtraction;df64_mult
is a function that does not accept a 32-bit float as an argument; df64_mult
是一个不接受 32 位浮点数作为参数的函数; it only supports two pairs of 32-bit floats as arguments.df64_add
, which also only accepts pairs of 32-bit floats as arguments, but is here invoked with the first argument being only a single 32-bit float. df64_add
也是df64_add
,它也只接受成对的 32 位浮点数作为参数,但在这里调用时第一个参数只有一个 32 位浮点数。 I'm making an educated guess that this is a correct implementation of this code, but because even a correct implementation of this function has unavoidable errors in the computation, I can't tell if it's correct, even if it gives values that "seem" correct:我有根据地猜测这是此代码的正确实现,但因为即使此函数的正确实现在计算中也有不可避免的错误,我无法判断它是否正确,即使它给出的值“似乎“ 正确的:
float2 df64_div(float2 B, float2 A) {
float xn = 1.0f / A.x;
float yn = B.x * xn;
float diff = (df64_diff(B, df64_mult(A, float2(yn, 0)))).x;
float2 prod = twoProd(xn, diff);
return df64_add(float2(yn, 0), prod);
}
float2 df64_diff(float2 a, float2 b) {
return df64_add(a, float2(-b.x, -b.y));
}
So my question is this: is the written implementation of this algorithm as seen in the paper accurate (because it depends on behavior of the Cg language that I'm not aware of?), or isn't it?所以我的问题是:论文中看到的这个算法的书面实现是否准确(因为它取决于我不知道的 Cg 语言的行为?),或者不是吗? And regardless, is my interpolation of that code a correct implementation of the division algorithm described in the paper?
无论如何,我对该代码的插值是否是论文中描述的除法算法的正确实现?
Note: my target language is C++, so although the differences between the languages (for this kind of algorithm) are minor, my code is written in C++ and I'm looking for correctness for the C++ language.注意:我的目标语言是 C++,所以虽然语言之间的差异(对于这种算法)很小,但我的代码是用 C++ 编写的,我正在寻找 C++ 语言的正确性。
Reviewing the Pseudocode Algorithm as written in the book appears to support the C++ implementation of this algorithm, although my unfamiliarity with Cg means I can't prove that that implementation is correct for Cg.回顾书中编写的伪代码算法似乎支持该算法的 C++ 实现,尽管我对 Cg 的不熟悉意味着我无法证明该实现对于 Cg 是正确的。
So breaking down these steps into plain english:因此,将这些步骤分解为简单的英语:
twoProd
function exists for exactly this purpose. twoProd
函数正是为了这个目的而存在的。So understanding this algorithm we can map each of these steps directly to the C++ algorithm I wrote:所以理解这个算法,我们可以将这些步骤中的每一个直接映射到我写的 C++ 算法:
//(1) Takes two [pseudo-]doubles, returns a [pseudo-]double
float2 df64_div(float2 B, float2 A) {
//(2) single float divided by single float
float xn = 1.0f / A.x;
// (3) single float multiplied by single float
float yn = B.x * xn;
// (4) double x double multiplication
// (4a) yn promoted to [pseudo-]double
// (5) subtraction (5a) only higher order component kept
float diff = (df64_diff(B, df64_mult(A, float2(yn, 0)))).x;
// (6) single x single multiplication with extra precision preserved using twoProd
float2 prod = twoProd(xn, diff);
// (7) adding higher-order division to lower order division
// (7a) yn promoted to [pseudo-]double
// (8) value is returned
return df64_add(float2(yn, 0), prod);
}
float2 df64_diff(float2 a, float2 b) {
// (5a) negating both components is a logical negation of the whole number
return df64_add(a, float2(-b.x, -b.y));
}
From this, we can conclude that this is a correct implementation of the algorithm described in this paper, upheld by some testing I've done to validate that performing these operations in this manner yields results that appear to be correct.由此,我们可以得出结论,这是本文中描述的算法的正确实现,我进行了一些测试,以验证以这种方式执行这些操作会产生看起来正确的结果。
Xirema's answer provides a faithful rendering of Thall's high-radix long-hand division algorithm into C++. Xirema 的回答将 Thall 的高基数长手除法算法忠实地呈现为 C++。 Based on fairly extensive testing against a higher-precision reference, I found its maximum relative error to be on the order of 2 -45 , provided there are no underflows in intermediate computation.
基于对更高精度参考的相当广泛的测试,我发现它的最大相对误差约为 2 -45 ,前提是中间计算中没有下溢。
On platforms that provide a fused multiply-add operation ( FMA ), the following Newton-Raphson-based division algorithm due to Nagai et.在提供融合乘加运算 ( FMA ) 的平台上, Nagai 等人提出了以下基于 Newton-Raphson 的除法算法 。 al.
阿尔。 is likely to be more efficient and achieves identical accuracy in my testing, that is, maximum relative error of 2 -45 .
在我的测试中可能更有效并达到相同的精度,即最大相对误差为 2 -45 。
/*
T. Nagai, H. Yoshida, H. Kuroda, Y. Kanada, "Fast Quadruple Precision
Arithmetic Library on Parallel Computer SR11000/J2." In: Proceedings
of the 8th International Conference on Computational Science, ICCS '08,
Part I, pp. 446-455.
*/
float2 div_df64 (float2 a, float2 b)
{
float2 t, c;
float r, s;
r = 1.0f / b.x;
t.x = a.x * r;
s = fmaf (-b.x, t.x, a.x);
t.x = fmaf (r, s, t.x);
t.y = fmaf (-b.x, t.x, a.x);
t.y = a.y + t.y;
t.y = fmaf (-b.y, t.x, t.y);
s = r * t.y;
t.y = fmaf (-b.x, s, t.y);
t.y = fmaf (r, t.y, s);
c.x = t.x + t.y;
c.y = (t.x - c.x) + t.y;
return c;
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.