简体   繁体   English

具有常数整数除数的高效浮点除法

[英]Efficient floating-point division with constant integer divisors

A recent question , whether compilers are allowed to replace floating-point division with floating-point multiplication, inspired me to ask this question. 最近的一个问题是 ,编译器是否允许用浮点乘法替换浮点除法,这激发了我提出这个问题。

Under the stringent requirement, that the results after code transformation shall be bit-wise identical to the actual division operation, it is trivial to see that for binary IEEE-754 arithmetic, this is possible for divisors that are a power of two. 在严格的要求下,代码转换后的结果应与实际的除法运算在位上相同,可以看出,对于二进制IEEE-754算术,这对于2的幂的除数是可能的。 As long as the reciprocal of the divisor is representable, multiplying by the reciprocal of the divisor delivers results identical to the division. 只要除数的倒数可以表示,乘以除数的倒数就可以得到与除数相同的结果。 For example, multiplication by 0.5 can replace division by 2.0 . 例如,乘以0.5可以代替除以2.0

One then wonders for what other divisors such replacements work, assuming we allow any short instruction sequence that replaces division but runs significantly faster, while delivering bit-identical results. 然后人们想知道其他除数这样的替换是什么工作,假设我们允许任何短指令序列取代除法但运行速度明显更快,同时提供相同的结果。 In particular allow fused multiply-add operations in addition to plain multiplication. 特别是除了普通乘法之外,还允许融合乘法 - 加法运算。 In comments I pointed to the following relevant paper: 在评论中,我指出了以下相关文件:

Nicolas Brisebarre, Jean-Michel Muller, and Saurabh Kumar Raina. Nicolas Brisebarre,Jean-Michel Muller和Saurabh Kumar Raina。 Accelerating correctly rounded floating-point division when the divisor is known in advance. 当事先知道除数时,加速正确舍入的浮点除法。 IEEE Transactions on Computers, Vol. IEEE Transactions on Computers,Vol。 53, No. 8, August 2004, pp. 1069-1072. 53,第8期,2004年8月,第1069-1072页。

The technique advocated by the authors of the paper precomputes the reciprocal of the divisor y as a normalized head-tail pair z h :z l as follows: z h = 1 / y, z l = fma (-y, z h , 1) / y . 本文作者提出的技术预先计算了除数y的倒数作为归一化的头尾对z h :z l如下: z h = 1 / y,z l = fma(-y,z h ,1 )/ y Later, the division q = x / y is then computed as q = fma (z h , x, z l * x) . 之后,将除数q = x / y计算为q = fma(z h ,x,z l * x) The paper derives various conditions that divisor y must satisfy for this algorithm to work. 本文推导出除数y必须满足的各种条件才能使该算法起作用。 As one readily observes, this algorithm has problems with infinities and zero when the signs of head and tail differ. 正如人们容易观察到的那样,当头尾迹象不同时,该算法存在无穷大和零的问题。 More importantly, it will fail to deliver correct results for dividends x that are very small in magnitude, because computation of the quotient tail, z l * x , suffers from underflow. 更重要的是,它将无法为数量非常小的股息x提供正确的结果,因为商尾的计算z l * x受到下溢的影响。

The paper also makes a passing reference to an alternative FMA-based division algorithm, pioneered by Peter Markstein when he was at IBM. 本文还提到了另一种基于FMA的划分算法,该算法由Peter Markstein在IBM工作时开创。 The relevant reference is: 相关参考是:

PW Markstein. PW Markstein。 Computation of elementary functions on the IBM RISC System/6000 processor. 在IBM RISC System / 6000处理器上计算基本功能。 IBM Journal of Research & Development, Vol. IBM Journal of Research&Development,Vol。 34, No. 1, January 1990, pp. 111-119 1990年1月34日第1号,第111-119页

In Markstein's algorithm, one first computes a reciprocal rc , from which an initial quotient q = x * rc is formed. 在Markstein算法中,首先计算倒数rc ,从中形成初始商q = x * rc Then, the remainder of the division is computed accurately with an FMA as r = fma (-y, q, x) , and an improved, more accurate quotient is finally computed as q = fma (r, rc, q) . 然后,用FMA精确地计算除法的余数r = fma(-y,q,x) ,并且最终计算出改进的,更准确的商,其为q = fma(r,rc,q)

This algorithm also has issues for x that are zeroes or infinities (easily worked around with appropriate conditional execution), but exhaustive testing using IEEE-754 single-precision float data shows that it delivers the correct quotient across all possibe dividends x for many divisors y , among these many small integers. 该算法还存在x为零或无穷的问题(可以通过适当的条件执行轻松解决),但使用IEEE-754单精度float数据的详尽测试表明,它为所有可能的红利提供了正确的商数x对于许多除数y在这些许多小整数中。 This C code implements it: 这个C代码实现了它:

/* precompute reciprocal */
rc = 1.0f / y;

/* compute quotient q=x/y */
q = x * rc;
if ((x != 0) && (!isinf(x))) {
    r = fmaf (-y, q, x);
    q = fmaf (r, rc, q);
}

On most processor architectures, this should translate into a branchless sequence of instructions, using either predication, conditional moves, or select-type instructions. 在大多数处理器体系结构中,这应该转换为无分支指令序列,使用预测,条件移动或选择类型指令。 To give a concrete example: For division by 3.0f , the nvcc compiler of CUDA 7.5 generates the following machine code for a Kepler-class GPU: 举一个具体的例子:对于3.0f除法,CUDA 7.5的nvcc编译器为Kepler级GPU生成以下机器代码:

    LDG.E R5, [R2];                        // load x
    FSETP.NEU.AND P0, PT, |R5|, +INF , PT; // pred0 = fabsf(x) != INF
    FMUL32I R2, R5, 0.3333333432674408;    // q = x * (1.0f/3.0f)
    FSETP.NEU.AND P0, PT, R5, RZ, P0;      // pred0 = (x != 0.0f) && (fabsf(x) != INF)
    FMA R5, R2, -3, R5;                    // r = fmaf (q, -3.0f, x);
    MOV R4, R2                             // q
@P0 FFMA R4, R5, c[0x2][0x0], R2;          // if (pred0) q = fmaf (r, (1.0f/3.0f), q)
    ST.E [R6], R4;                         // store q

For my experiments, I wrote the tiny C test program shown below that steps through integer divisors in increasing order and for each of them exhaustively tests the above code sequence against the proper division. 对于我的实验,我编写了下面显示的微小C测试程序,它按递增顺序逐步执行整数除数,并且每一个都按照正确的除法详尽地测试上面的代码序列。 It prints a list of the divisors that passed this exhaustive test. 它会打印一份通过此详尽测试的除数列表。 Partial output looks as follows: 部分输出如下:

PASS: 1, 2, 3, 4, 5, 7, 8, 9, 11, 13, 15, 16, 17, 19, 21, 23, 25, 27, 29, 31, 32, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55, 57, 59, 61, 63, 64, 65, 67, 69,

To incorporate the replacement algorithm into a compiler as an optimization, a whitelist of divisors to which the above code transformation can safely be applied is impractical. 为了将替换算法作为优化结合到编译器中,可以安全地应用上述代码转换的除数白名单是不切实际的。 The output of the program so far (at a rate of about one result per minute) suggests that the fast code works correctly across all possible encodings of x for those divisors y that are odd integers or are powers of two. 到目前为止,程序的输出(以每分钟大约一个结果的速率)表明快速代码在x所有可能编码中对于奇数整数或2的幂的除数y正确地工作。 Anecdotal evidence, not a proof, of course. 轶事证据,当然不是证明。

What set of mathematical conditions can determine a-priori whether the transformation of division into the above code sequence is safe? 什么样的数学条件可以先验地确定划分为上述代码序列的转换是否安全? Answers can assume that all the floating-point operations are performed in the default rounding mode of "round to nearest or even". 答案可以假设所有浮点运算都在默认的舍入模式“round to nearest or even”中执行。

#include <stdlib.h>
#include <stdio.h>
#include <math.h>

int main (void)
{
    float r, q, x, y, rc;
    volatile union {
        float f;
        unsigned int i;
    } arg, res, ref;
    int err;

    y = 1.0f;
    printf ("PASS: ");
    while (1) {
        /* precompute reciprocal */
        rc = 1.0f / y;

        arg.i = 0x80000000;
        err = 0;
        do {
            /* do the division, fast */
            x = arg.f;
            q = x * rc;
            if ((x != 0) && (!isinf(x))) {
                r = fmaf (-y, q, x);
                q = fmaf (r, rc, q);
            }
            res.f = q;
            /* compute the reference, slowly */
            ref.f = x / y;

            if (res.i != ref.i) {
                err = 1;
                break;
            }
            arg.i--;
        } while (arg.i != 0x80000000);

        if (!err) printf ("%g, ", y);
        y += 1.0f;
    }
    return EXIT_SUCCESS;
}

This question asks for a way to identify the values of the constant Y that make it safe to transform x / Y into a cheaper computation using FMA for all possible values of x . 这个问题要求一种方法来识别常数Y的值,这样可以安全地将x / Y转换为使用FMA对x所有可能值进行更便宜的计算。 Another approach is to use static analysis to determine an over-approximation of the values x can take, so that the generally unsound transformation can be applied in the knowledge that the values for which the transformed code differs from the original division do not happen. 另一种方法是使用静态分析来确定x可以采用的值的过度近似,使得通常不健全的变换可以应用于变换的代码与原始分割不同的值不会发生的知识。

Using representations of sets of floating-point values that are well adapted to the problems of floating-point computations, even a forwards analysis starting from the beginning of the function can produce useful information. 使用很好地适应浮点计算问题的浮点值集合的表示,甚至从函数的开头开始的向前分析也可以产生有用的信息。 For instance: 例如:

float f(float z) {
  float x = 1.0f + z;
  float r = x / Y;
  return r;
}

Assuming the default round-to-nearest mode(*), in the above function x can only be NaN (if the input is NaN), +0.0f, or a number larger than 2 -24 in magnitude, but not -0.0f or anything closer to zero than 2 -24 . 假设默认的舍入到最近模式(*),在上面的函数x中只能是NaN(如果输入是NaN),+ 0.0f,或者数量大于2到24的数字,但不是-0.0f或者比2 -24更接近零的任何东西。 This justifies the transformation into one of the two forms shown in the question for many values of the constant Y . 这证明了对于常数Y许多值,转换为问题中所示的两种形式之一。

(*) assumption without which many optimizations are impossible and that C compilers already make unless the program explicitly uses #pragma STDC FENV_ACCESS ON (*)假设没有这种假设,许多优化都是不可能的,并且C编译器已经制作,除非程序显式使用#pragma STDC FENV_ACCESS ON


A forwards static analysis that predicts the information for x above can be based on a representation of sets of floating-point values an expression can take as a tuple of: 预测上面x的信息的转发静态分析可以基于表达式可以作为以下元组的浮点值集合的表示:

  • a representation for the sets of possible NaN values (Since behaviors of NaN are underspecified, a choice is to use only a boolean, with true meaning some NaNs can be present, and false indicating no NaN is present.), 可能的NaN值集合的表示(由于NaN的行为未被指定,选择是仅使用布尔值, true含义可以存在一些NaN,并且false表示不存在NaN。),
  • four boolean flags indicating respectively the presence of +inf, -inf, +0.0, -0.0, 四个布尔标志分别表示+ inf,-inf,+0.0,-0.0的存在,
  • an inclusive interval of negative finite floating-point values, and 负有限浮点值的包含间隔,和
  • an inclusive interval of positive finite floating-point values. 正有限浮点值的包含间隔。

In order to follow this approach, all the floating-point operations that can occur in a C program must be understood by the static analyzer. 为了遵循这种方法,静态分析器必须理解C程序中可能发生的所有浮点运算。 To illustrate, the addition betweens sets of values U and V, to be used to handle + in the analyzed code, can be implemented as: 为了说明,在分析的代码中用于处理+的值集U和V之间的相加可以实现为:

  • If NaN is present in one of the operands, or if the operands can be infinities of opposite signs, NaN is present in the result. 如果NaN存在于其中一个操作数中,或者如果操作数可以是相反符号的无穷大,则结果中存在NaN。
  • If 0 cannot be a result of the addition of a value of U and a value of V, use standard interval arithmetic. 如果0不能是添加U值和V值的结果,则使用标准区间运算。 The upper bound of the result is obtained for the round-to-nearest addition of the largest value in U and the largest value in V, so these bounds should be computed with round-to-nearest. 对于U中最大值和V中最大值的舍入到最近相加,得到结果的上界,因此这些边界应该用舍入到最近计算。
  • If 0 can be a result of the addition of a positive value of U and a negative value of V, then let M be the smallest positive value in U such that -M is present in V. 如果0可以是添加U的正值和V的负值的结果,那么令M是U中的最小正值,使得-M存在于V.
    • if succ(M) is present in U, then this pair of values contributes succ(M) - M to the positive values of the result. 如果u中存在succ(M),那么这对值将succ(M)-M贡献给结果的正值。
    • if -succ(M) is present in V, then this pair of values contributes the negative value M - succ(M) to the negative values of the result. 如果-succ(M)存在于V中,那么这对值将负值M-succ(M)贡献给结果的负值。
    • if pred(M) is present in U, then this pair of values contributes the negative value pred(M) - M to the negative values of the result. 如果在U中存在pred(M),那么这对值将负值pred(M)-M贡献给结果的负值。
    • if -pred(M) is present in V, then this pair of values contributes the value M - pred(M) to the positive values of the result. 如果-pred(M)存在于V中,那么这对值将值M-pred(M)贡献给结果的正值。
  • Do the same work if 0 can be the result of the addition of a negative value of U and a positive value of V. 如果0可以是添加负值U和正值V的结果,则执行相同的工作。

Acknowledgement: the above borrows ideas from “Improving the Floating Point Addition and Subtraction Constraints”, Bruno Marre & Claude Michel 致谢:上述借鉴了“改善浮点加法和减法约束”,Bruno Marre和Claude Michel


Example: compilation of the function f below: 示例:下面的函数f编译:

float f(float z, float t) {
  float x = 1.0f + z;
  if (x + t == 0.0f) {
    float r = x / 6.0f;
    return r;
  }
  return 0.0f;
}

The approach in the question refuses to transform the division in function f into an alternate form, because 6 is not one of the value for which the division can be unconditionally transformed. 问题中的方法拒绝将函数f的除法转换为替代形式,因为6不是可以无条件转换除法的值之一。 Instead, what I am suggesting is to apply a simple value analysis starting from the beginning of the function which, in this case, determines that x is a finite float either +0.0f or at least 2 -24 in magnitude, and to use this information to apply Brisebarre et al's transformation, confident in the knowledge that x * C2 does not underflow. 相反,我建议的是从函数的开头应用一个简单的值分析,在这种情况下,确定x是一个有限的浮点数+0.0f或至少2-24的数量级,并使用它应用Brisebarre等人的转变的信息,对x * C2不会下溢的知识充满信心。

To be explicit, I am suggesting to use an algorithm such as the one below to decide whether or not to transform the division into something simpler: 为了明确,我建议使用如下所示的算法来决定是否将分割转换为更简单的分类:

  1. Is Y one of the values that can be transformed using Brisebarre et al's method according to their algorithm? Y是根据他们的算法使用Brisebarre等人的方法可以转换的值之一吗?
  2. Do C1 and C2 from their method have the same sign, or is it possible to exclude the possibility that the dividend is infinite? 他们的方法中的C1和C2是否具有相同的符号,或者是否可以排除红利无限的可能性?
  3. Do C1 and C2 from their method have the same sign, or can x take only one of the two representations of 0? 他们的方法中的C1和C2是否具有相同的符号,或者x只能采用0的两个表示中的一个? If in the case where C1 and C2 have different signs and x can only be one representation of zero, remember to fiddle(**) with the signs of the FMA-based computation to make it produce the correct zero when x is zero. 如果在C1和C2具有不同符号并且x只能是零的一个表示的情况下,记住用基于FMA的计算的符号来调整(**)以使其在x为零时产生正确的零。
  4. Can the magnitude of the dividend be guaranteed to be large enough to exclude the possibility that x * C2 underflows? 可以保证股息的大小足以排除x * C2下溢的可能性吗?

If the answer to the four questions is “yes”, then the division can be transformed into a multiplication and an FMA in the context of the function being compiled. 如果对四个问题的答案为“是”,则可以在正在编译的函数的上下文中将除法转换为乘法和FMA。 The static analysis described above serves to answer questions 2., 3. and 4. 上述静态分析用于回答问题2,3和4。

(**) “fiddling with the signs” means using -FMA(-C1, x, (-C2)*x) in place of FMA(C1, x, C2*x) when this is necessary to make the result come out correctly when x can only be one of the two signed zeroes (**)“摆弄标志”是指使用-FMA(-C1,x,( - C2)* x)代替FMA(C1,x,C2 * x),这样才能使结果出来当x只能是两个带符号的零之一时正确

Let me restart for the third time. 让我第三次重启。 We are trying to accelerate 我们正在努力加速

    q = x / y

where y is an integer constant, and q , x , and y are all IEEE 754-2008 binary32 floating-point values. 其中y是整数常量, qxy都是IEEE 754-2008 binary32浮点值。 Below, fmaf(a,b,c) indicates a fused multiply-add a * b + c using binary32 values. 下面, fmaf(a,b,c)表示使用binary32值的融合乘法加a * b + c

The naive algorithm is via a precalculated reciprocal, 天真的算法是通过预先计算的倒数,

    C = 1.0f / y

so that at runtime a (much faster) multiplication suffices: 这样在运行时一个(快得多)乘法就足够了:

    q = x * C

The Brisebarre-Muller-Raina acceleration uses two precalculated constants, Brisebarre-Muller-Raina加速度使用两个预先计算的常数,

    zh = 1.0f / y
    zl = -fmaf(zh, y, -1.0f) / y

so that at runtime, one multiplication and one fused multiply-add suffices: 这样在运行时,一个乘法和一个融合乘法 - 加法就足够了:

    q = fmaf(x, zh, x * zl)

The Markstein algorithm combines the naive approach with two fused multiply-adds that yields the correct result if the naive approach yields a result within 1 unit in the least significant place, by precalculating Markstein算法将朴素方法与两个融合乘法相加结合起来,如果天真方法在最不重要的位置产生1个单位内的结果,通过预先计算得到正确的结果

    C1 = 1.0f / y
    C2 = -y

so that the divison can be approximated using 因此,可以使用近似来区分

    t1 = x * C1
    t2 = fmaf(C1, t1, x)
    q  = fmaf(C2, t2, t1)

The naive approach works for all powers of two y , but otherwise it is pretty bad. 天真的方法也适用于两种一切权力y ,但除此之外,它是非常糟糕的。 For example, for divisors 7, 14, 15, 28, and 30, it yields an incorrect result for more than half of all possible x . 例如,对于除数7,14,15,28和30,它对于所有可能的x一半以上产生不正确的结果。

The Brisebarre-Muller-Raina approach similarly fails for almost all non-power of two y , but much fewer x yield the incorrect result (less than half a percent of all possible x , varies depending on y ). Brisebarre-Muller-Raina方法几乎同样失败了几乎所有两个y非幂,但是更少的x产生不正确的结果(不到所有可能的x的百分之几,根据y而变化)。

The Brisebarre-Muller-Raina article shows that the maximum error in the naive approach is ±1.5 ULPs. Brisebarre-Muller-Raina文章显示,天真方法的最大误差为±1.5 ULPs。

The Markstein approach yields correct results for powers of two y , and also for odd integer y . Markstein方法得到2 y幂的正确结果,也得到奇数y正确结果。 (I have not found a failing odd integer divisor for the Markstein approach.) (我没有找到Markstein方法的失败奇整数除数。)


For the Markstein approach, I have analysed divisors 1 - 19700 ( raw data here ). 对于Markstein方法,我已经分析了除数1 - 19700( 原始数据 )。

Plotting the number of failure cases (divisor in the horizontal axis, the number of values of x where Markstein approach fails for said divisor), we can see a simple pattern occur: 绘制失败案例的数量(横轴的除数,Markstein逼近除数的x的值的数量),我们可以看到一个简单的模式:

Markstein failure cases http://www.nominal-animal.net/answers/markstein.png Markstein失败案例http://www.nominal-animal.net/answers/markstein.png

Note that these plots have both horizontal and vertical axes logarithmic. 请注意,这些图的水平轴和垂直轴都是对数的。 There are no dots for odd divisors, as the approach yields correct results for all odd divisors I've tested. 奇数除数没有点,因为这种方法可以为我测试的所有奇数除数产生正确的结果。

If we change the x axis to the bit reverse (binary digits in reverse order, ie 0b11101101 → 0b10110111, data ) of the divisors, we have a very clear pattern: Markstein failure cases, bit reverse divisor http://www.nominal-animal.net/answers/markstein-failures.png 如果我们将x轴更改为除数的位反转(反向二进制数字,即0b11101101→0b10110111, 数据 ),我们有一个非常清晰的模式: Markstein失败情况,位反转除数http://www.nominal- animal.net/answers/markstein-failures.png

If we draw a straight line through the center of the point sets, we get curve 4194304/x . 如果我们在点集的中心绘制一条直线,我们得到曲线4194304/x (Remember, the plot considers only half the possible floats, so when considering all possible floats, double it.) 8388608/x and 2097152/x bracket the entire error pattern completely. (请记住,该图只考虑了一半可能的浮点数,因此在考虑所有可能的浮点数时,加倍它。) 8388608/x2097152/x完全包含整个错误模式。

Thus, if we use rev(y) to compute the bit reverse of divisor y , then 8388608/rev(y) is a good first order approximation of the number of cases (out of all possible float) where the Markstein approach yields an incorrect result for an even, non-power-of-two divisor y . 因此,如果我们使用rev(y)计算除数y的位反转,那么8388608/rev(y)是一个很好的一阶近似的案例数(在所有可能的浮点数中),其中Markstein方法产生不正确得到一个均匀的,非幂二的除数y (Or, 16777216/rev(x) for the upper limit.) (或者, 16777216/rev(x)为上限。)

Added 2016-02-28: I found an approximation for the number of error cases using the Markstein approach, given any integer (binary32) divisor. 添加2016-02-28:在给定任何整数(binary32)除数的情况下,我找到了使用Markstein方法的错误情况数的近似值。 Here it is as pseudocode: 这是伪代码:

function markstein_failure_estimate(divisor):
    if (divisor is zero)
        return no estimate
    if (divisor is not an integer)
        return no estimate

    if (divisor is negative)
        negate divisor

    # Consider, for avoiding underflow cases,
    if (divisor is very large, say 1e+30 or larger)
        return no estimate - do as division

    while (divisor > 16777216)
        divisor = divisor / 2

    if (divisor is a power of two)
        return 0

    if (divisor is odd)
        return 0

    while (divisor is not odd)
        divisor = divisor / 2

    # Use return (1 + 83833608 / divisor) / 2
    # if only nonnegative finite float divisors are counted!
    return 1 + 8388608 / divisor

This yields a correct error estimate to within ±1 on the Markstein failure cases I have tested (but I have not yet adequately tested divisors larger than 8388608). 在我测试的Markstein失效案例中,这产生了一个正确的误差估计值±1(但我还没有充分测试大于8388608的除数)。 The final division should be such that it reports no false zeroes, but I cannot guarantee it (yet). 最终的划分应该是它没有报告错误的零,但我不能保证它(还)。 It does not take into account very large divisors (say 0x1p100, or 1e+30, and larger in magnitude) which have underflow issues -- I would definitely exclude such divisors from acceleration anyway. 它没有考虑具有下溢问题的非常大的除数(比如0x1p100,或1e + 30,并且幅度更大) - 无论如何我绝对会将这些除数从加速中排除。

In preliminary testing, the estimate seems uncannily accurate. 在初步测试中,估计似乎非常准确。 I did not draw a plot comparing the estimates and the actual errors for divisors 1 to 20000, because the points all coincide exactly in the plots. 我没有绘制比较估计值和除数1到20000的实际误差的图,因为这些点在图中完全重合。 (Within this range, the estimate is exact, or one too large.) Essentially, the estimates reproduce the first plot in this answer exactly. (在此范围内,估计值是精确的,或者太大。)基本上,估计值会准确地再现此答案中的第一个图。


The pattern of failures for the Markstein approach is regular, and very interesting. Markstein方法的失败模式是规则的,非常有趣。 The approach works for all power of two divisors, and all odd integer divisors. 该方法适用于两个除数的所有幂和所有奇数整数除数。

For divisors greater than 16777216, I consistently see the same errors as for a divisor that is divided by the smallest power of two to yield a value less than 16777216. For example, 0x1.3cdfa4p+23 and 0x1.3cdfa4p+41, 0x1.d8874p+23 and 0x1.d8874p+32, 0x1.cf84f8p+23 and 0x1.cf84f8p+34, 0x1.e4a7fp+23 and 0x1.e4a7fp+37. 对于大于16777216的除数,我一直看到与除数相同的误差,除数除以2的最小幂,得到小于16777216的值。例如,0x1.3cdfa4p + 23和0x1.3cdfa4p + 41,0x1。 d8874p + 23和0x1.d8874p + 32,0x1.cf84f8p + 23和0x1.cf84f8p + 34,0x1.e4a7fp + 23和0x1.e4a7fp + 37。 (Within each pair, the mantissa is the same, and only the power of two varies.) (在每对中,尾数是相同的,只有2的幂变化。)

Assuming my test bench is not in error, this means that the Markstein approach also works divisors larger than 16777216 in magnitude (but smaller than, say, 1e+30), if the divisor is such that when divided by the smallest power of two that yields a quotient of less than 16777216 in magnitude, and the quotient is odd. 假设我的测试平台没有错误,这意味着Markstein方法在大小上也会使用大于16777216的除数(但是比1e + 30更小),如果除数是这样的除数除以2的最小幂。产生的数量小于16777216,并且商是奇数。

The result of a floating point division is: 浮点除法的结果是:

  • a sign flag 标志旗
  • a significand 有意义的
  • an exponent 指数
  • a set of flags (overflow, underflow, inexact, etc - see fenv() ) 一组标志(溢出,下溢,不精确等等 - 参见fenv()

Getting the first 3 pieces correct (but the set of flags incorrect) is not enough. 获得前3个正确(但标志集不正确)是不够的。 Without further knowledge (eg which parts of which pieces of the result actually matter, the possible values of the dividend, etc) I would assume that replacing division by a constant with multiplication by a constant (and/or a convoluted FMA mess) is almost never safe. 如果没有进一步的知识(例如,哪些部分的结果实际上是重要的,被除数的可能值等),我会假设用一个常数(和/或一个复杂的FMA混乱)替换除以常数从不安全。

In addition; 此外; for modern CPUs I also wouldn't assume that replacing a division with 2 FMAs is always an improvement. 对于现代CPU我也不会认为用2个FMA替换分区总是一个改进。 For example, if the bottleneck is instruction fetch/decode, then this "optimisation" would make performance worse. 例如,如果瓶颈是指令获取/解码,那么这种“优化”会使性能变差。 For another example, if subsequent instructions don't depend on the result (the CPU can do many other instructions in parallel while waiting for the result) the FMA version may introduce multiple dependency stalls and make performance worse. 再举一个例子,如果后续指令不依赖于结果(CPU可以在等待结果时并行执行许多其他指令),则FMA版本可能会引入多个依赖性停顿并使性能变差。 For a third example, if all registers are being used then the FMA version (which requires additional "live" variables) may increase "spilling" and make performance worse. 对于第三个示例,如果正在使用所有寄存器,那么FMA版本(需要额外的“实时”变量)可能会增加“溢出”并使性能变差。

Note that (in many but not all cases) division or multiplication by a constant multiple of 2 can be done with addition alone (specifically, adding a shift count to the exponent). 注意(在许多但不是所有情况下)除以2的常数倍可以单独添加(具体地,向指数添加移位计数)。

I love @Pascal 's answer but in optimization it's often better to have a simple and well-understood subset of transformations rather than a perfect solution. 我喜欢@Pascal的答案,但在优化中,通常更好的是拥有一个简单且易于理解的变换子集而不是完美的解决方案。

All current and common historical floating point formats had one thing in common: a binary mantissa. 所有当前和常见的历史浮点格式都有一个共同点:二进制尾数。

Therefore, all fractions were rational numbers of the form: 因此,所有分数都是形式的有理数:

x / 2 n x / 2 n

This is in contrast to the constants in the program (and all possible base-10 fractions) which are rational numbers of the form: 这与程序中的常量(以及所有可能的基数为10的分数)形成对比,这些常数是形式的有理数:

x / (2 n * 5 m ) x /(2 n * 5 m

So, one optimization would simply test the input and reciprocal for m == 0, since those numbers are represented exactly in the FP format and operations with them should produce numbers that are accurate within the format. 因此,一个优化将简单地测试输入和m == 0的倒数,因为这些数字完全以FP格式表示,并且使用它们的操作应该产生在格式内准确的数字。

So, for example, within the (decimal 2-digit) range of .01 to 0.99 dividing or multiplying by the following numbers would be optimized: 因此,例如,在.010.99的(十进制2位数)范围内,除以或乘以以下数字将被优化:

.25 .50 .75

And everything else would not. 其他一切都不会。 (I think, do test it first, lol.) (我想,先测试一下,哈哈。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM