简体   繁体   English

pow 算法(浮点数,浮点数)

[英]Algorithm for pow(float, float)

我需要一个有效的算法来在两个浮点数之间执行 math::power 函数,你知道如何做到这一点,(我需要算法不使用函数本身)

Since IEEE-754 binary floating-point numbers are fractions, computing a b is technically an algebraic operation.由于 IEEE-754 二进制浮点数是分数,因此计算 a b在技术上是一种代数运算。 However the common approach to implement powf(float a, float b) is as e b * log a , ie using transcendental functions.然而,实现powf(float a, float b)的常用方法是 e b * log a ,即使用超越函数。

There are a few caveats, however.但是,有一些警告。 log a is undefined for a < 0 , while powf() allows computation with some negative a . log a对于a < 0是未定义的,而powf()允许使用一些负数a进行计算。 Exponentiation, in the form of expf() suffers from error magnification, as I explained in my answer to this question .正如我在对这个问题的回答中所解释的那样,以expf()的形式取幂会受到误差放大的影响。 This requires us to compute log a with higher than single precision for an accurate powf() result.这要求我们以高于单精度的方式计算log a以获得准确的powf()结果。 There are various techniques to achieve this, a simple way is to use limited amounts of double- float computation, references for which I provided in my answer to this question .有多种技术可以实现这一点,一种简单的方法是使用有限数量的双float计算,我在回答这个问题时提供了参考。 The essence of double- float is that each floating-point operand is represented as a pair of float values called the "head" and the "tail", which satisfy the relation |tail|float的本质是每个浮点操作数都表示为一对float值,称为“头”和“尾”,它们满足关系 |tail| ≤ ½ * ulp (|head|) when properly normalized. ≤ ½ * ulp (|head|) 正确归一化。

The code below shows an exemplary implementation of this approach.下面的代码显示了这种方法的示例性实现。 It assumes that the IEEE 754-2008 operation FMA (fused multiply-add) is available, which is exposed in C as the standard math functions fma() , fmaf() .它假定 IEEE 754-2008 操作 FMA(融合乘加)可用,它在 C 中作为标准数学函数fma()fmaf() It does not provide for handling of errno or floating-point exceptions, but it does provide for the correct handling of all 18 special cases enumerated by the ISO C standard.它不提供对errno或浮点异常的处理,但它确实提供了对 ISO C 标准列举的所有 18 种特殊情况的正确处理。 Tests have been performed with denormal support enabled;已在启用非规范支持的情况下执行了测试; the code may or may not work properly within a non-IEEE-754 flush-to-zero (FTZ) environment.该代码可能会也可能不会在非 IEEE-754 清零 (FTZ) 环境中正常工作。

The exponentiation part employs a simple argument reduction based directly on the semi-logarithmic encoding of floating-point numbers, then applies a polynomial minimax approximation on the primary approximation interval.取幂部分采用直接基于浮点数的半对数编码的简单参数缩减,然后在主要近似区间上应用多项式 极小极大近似 The logarithm computation is based on log(x) = 2 atanh ((x-1) / (x+1)) , combined with selective use of double- float computation, to achieve a relative accuracy of 8.3e-10.对数计算基于log(x) = 2 atanh ((x-1) / (x+1)) ,结合选择性使用双float计算,以实现 8.3e-10 的相对精度。 The computation of b * log a is performed as a double- float operation, the accuracy of the final exponentiation is improved by linear interpolation, by observing that e x is its own derivative, and that therefore e x+y ≅ e x + y ⋅ e x , when |y| b * log a的计算是作为双float运算执行的,通过观察 e x是它自己的导数,通过线性插值提高最终求幂的精度,因此 e x+y ≅ e x + y ⋅ e x , 当 |y| ≪ |x|. ≪ |x|。

Double- float computation becomes a bit iffy near overflow boundaries, there are two instances of this in the code.float计算在溢出边界附近变得有点不确定,代码中有两个实例。 When the head portion of the input to exp is just causing the result to overflow to infinity, the tail portion may be negative, such that the result of powf() is actually finite.exp输入的头部只是导致结果溢出到无穷大时,尾部可能是负数,这样powf()的结果实际上是有限的。 One way to address this is to decrease the value of the "head" by one ulp in such a case, an alternative is to compute the head via addition in round-to-zero mode where readily available, since this will ensure like signs for head and tail.解决这个问题的一种方法是在这种情况下将“头”的值减少一个ulp ,另一种方法是在容易获得的情况下通过舍入到零模式的加法来计算头,因为这将确保类似的符号首尾。 The other caveat is that if the exponentiation does overflow, we cannot interpolate the result, as doing so would create a NaN.另一个需要注意的是,如果求幂确实溢出,我们不能对结果进行插值,因为这样做会创建一个 NaN。

It should be noted that the accuracy of the logarithm computation used here is not sufficient to ensure a faithfully-rounded powf() implementation, but it provides a reasonably small error (the maximum error I have found in extensive testing is less than 2 ulps) and it allows the code to be kept reasonably simple for the purpose of demonstrating relevant design principles.应该注意的是,这里使用的对数计算的准确性不足以确保忠实地舍入powf()实现,但它提供了一个相当小的误差(我在广泛的测试中发现的最大误差小于 2 ulps)为了展示相关的设计原则,它允许代码保持相当简单。

#include <stdint.h> // for uint32_t
#include <string.h> // for memcpy
#include <math.h>   // for frexpf, ldexpf, isinf, nextafterf

#define PORTABLE (1) // 0=bit-manipulation of 'float', 1= math library functions

uint32_t float_as_uint32 (float a) 
{ 
    uint32_t r; 
    memcpy (&r, &a, sizeof r); 
    return r; 
}

float uint32_as_float (uint32_t a) 
{ 
    float r; 
    memcpy (&r, &a, sizeof r); 
    return r; 
}

/* Compute log(a) with extended precision, returned as a double-float value 
   loghi:loglo. Maximum relative error: 8.5626e-10.
*/
void my_logf_ext (float a, float *loghi, float *loglo)
{
    const float LOG2_HI =  6.93147182e-1f; //  0x1.62e430p-1
    const float LOG2_LO = -1.90465421e-9f; // -0x1.05c610p-29
    const float SQRT_HALF = 0.70710678f;
    float m, r, i, s, t, p, qhi, qlo;
    int e;

    /* Reduce argument to m in [sqrt(0.5), sqrt(2.0)] */
#if PORTABLE
    m = frexpf (a, &e);
    if (m < SQRT_HALF) {
        m = m + m;
        e = e - 1;
    }
    i = (float)e;
#else // PORTABLE
    const float POW_TWO_M23 = 1.19209290e-7f; // 0x1.0p-23
    const float POW_TWO_P23 = 8388608.0f; // 0x1.0p+23
    const float FP32_MIN_NORM = 1.175494351e-38f; // 0x1.0p-126
    i = 0.0f;
    /* fix up denormal inputs */
    if (a < FP32_MIN_NORM){
        a = a * POW_TWO_P23;
        i = -23.0f;
    }
    e = (float_as_uint32 (a) - float_as_uint32 (SQRT_HALF)) & 0xff800000;
    m = uint32_as_float (float_as_uint32 (a) - e);
    i = fmaf ((float)e, POW_TWO_M23, i);
#endif // PORTABLE
    /* Compute q = (m-1)/(m+1) as a double-float qhi:qlo */
    p = m + 1.0f;
    m = m - 1.0f;
    r = 1.0f / p;
    qhi = r * m;
    qlo = r * fmaf (qhi, -m, fmaf (qhi, -2.0f, m));
    /* Approximate atanh(q), q in [sqrt(0.5)-1, sqrt(2)-1] */ 
    s = qhi * qhi;
    r =             0.1293334961f;  // 0x1.08c000p-3
    r = fmaf (r, s, 0.1419928074f); // 0x1.22cd9cp-3
    r = fmaf (r, s, 0.2000148296f); // 0x1.99a162p-3
    r = fmaf (r, s, 0.3333332539f); // 0x1.555550p-2
    t = fmaf (qhi, qlo + qlo, fmaf (qhi, qhi, -s)); // s:t = (qhi:qlo)**2
    p = s * qhi;
    t = fmaf (s, qlo, fmaf (t, qhi, fmaf (s, qhi, -p))); // p:t = (qhi:qlo)**3
    s = fmaf (r, p, fmaf (r, t, qlo));
    r = 2 * qhi;
    /* log(a) = 2 * atanh(q) + i * log(2) */
    t = fmaf ( LOG2_HI, i, r);
    p = fmaf (-LOG2_HI, i, t);
    s = fmaf ( LOG2_LO, i, fmaf (2.f, s, r - p));
    *loghi = p = t + s;    // normalize double-float result
    *loglo = (t - p) + s;
}

/* Compute exponential base e. No checking for underflow and overflow. Maximum
   ulp error = 0.86565 
*/
float my_expf_unchecked (float a)
{
    float f, j, r;
    int i;

    // exp(a) = 2**i * exp(f); i = rintf (a / log(2))
    j = fmaf (1.442695f, a, 12582912.f) - 12582912.f; // 0x1.715476p0, 0x1.8p23
    f = fmaf (j, -6.93145752e-1f, a); // -0x1.62e400p-1  // log_2_hi 
    f = fmaf (j, -1.42860677e-6f, f); // -0x1.7f7d1cp-20 // log_2_lo 
    i = (int)j;
    // approximate r = exp(f) on interval [-log(2)/2, +log(2)/2]
    r =             1.37805939e-3f;  // 0x1.694000p-10
    r = fmaf (r, f, 8.37312452e-3f); // 0x1.125edcp-7
    r = fmaf (r, f, 4.16695364e-2f); // 0x1.555b5ap-5
    r = fmaf (r, f, 1.66664720e-1f); // 0x1.555450p-3
    r = fmaf (r, f, 4.99999851e-1f); // 0x1.fffff6p-2
    r = fmaf (r, f, 1.00000000e+0f); // 0x1.000000p+0
    r = fmaf (r, f, 1.00000000e+0f); // 0x1.000000p+0
    // exp(a) = 2**i * r
#if PORTABLE
    r = ldexpf (r, i);
#else // PORTABLE
    float s, t;
    uint32_t ia = (i > 0) ? 0u : 0x83000000u;
    s = uint32_as_float (0x7f000000u + ia);
    t = uint32_as_float (((uint32_t)i << 23) - ia);
    r = r * s;
    r = r * t;
#endif // PORTABLE
    return r;
}

/* a**b = exp (b * log (a)), where a > 0, and log(a) is computed with extended 
   precision as a double-float. Maxiumum error found across 2**42 test cases:
   1.97302 ulp @ (0.71162397, -256.672424).
*/
float my_powf_core (float a, float b)
{
    const float MAX_IEEE754_FLT = uint32_as_float (0x7f7fffff);
    const float EXP_OVFL_BOUND = 88.7228394f; // 0x1.62e430p+6f;
    const float EXP_OVFL_UNFL_F = 104.0f;
    const float MY_INF_F = uint32_as_float (0x7f800000);
    float lhi, llo, thi, tlo, phi, plo, r;

    /* compute lhi:llo = log(a) */
    my_logf_ext (a, &lhi, &llo);
    /* compute phi:plo = b * log(a) */
    thi = lhi * b;
    if (fabsf (thi) > EXP_OVFL_UNFL_F) { // definitely overflow / underflow
        r = (thi < 0.0f) ? 0.0f : MY_INF_F;
    } else {
        tlo = fmaf (lhi, b, -thi);
        tlo = fmaf (llo, b, +tlo);
        /* normalize intermediate result thi:tlo, giving final result phi:plo */
#if FAST_FADD_RZ
        phi = __fadd_rz (thi, tlo);// avoid premature ovfl in exp() computation
#else // FAST_FADD_RZ
        phi = thi + tlo;
        if (phi == EXP_OVFL_BOUND){// avoid premature ovfl in exp() computation
#if PORTABLE
            phi = nextafterf (phi, 0.0f);
#else // PORTABLE
            phi = uint32_as_float (float_as_uint32 (phi) - 1);
#endif // PORTABLE
        }
#endif // FAST_FADD_RZ
        plo = (thi - phi) + tlo;
        /* exp'(x) = exp(x); exp(x+y) = exp(x) + exp(x) * y, for |y| << |x| */
        r = my_expf_unchecked (phi);
        /* prevent generation of NaN during interpolation due to r = INF */
        if (fabsf (r) <= MAX_IEEE754_FLT) {
            r = fmaf (plo, r, r);
        }
    }
    return r;
}

float my_powf (float a, float b)
{
    const float MY_INF_F = uint32_as_float (0x7f800000);
    const float MY_NAN_F = uint32_as_float (0xffc00000);
    int expo_odd_int;
    float r;

    /* special case handling per ISO C specification */
    expo_odd_int = fmaf (-2.0f, floorf (0.5f * b), b) == 1.0f;
    if ((a == 1.0f) || (b == 0.0f)) {
        r = 1.0f;
    } else if (isnan (a) || isnan (b)) {
        r = a + b;  // convert SNaN to QNanN or trigger exception
    } else if (isinf (b)) {
        r = ((fabsf (a) < 1.0f) != (b < 0.0f)) ? 0.0f :  MY_INF_F;
        if (a == -1.0f) r = 1.0f;
    } else if (isinf (a)) {
        r = (b < 0.0f) ? 0.0f : MY_INF_F;
        if ((a < 0.0f) && expo_odd_int) r = -r;
    } else if (a == 0.0f) {
        r = (expo_odd_int) ? (a + a) : 0.0f;
        if (b < 0.0f) r = copysignf (MY_INF_F, r);
    } else if ((a < 0.0f) && (b != floorf (b))) {
        r = MY_NAN_F;
    } else {
        r = my_powf_core (fabsf (a), b);
        if ((a < 0.0f) && expo_odd_int) {
            r = -r;
        }
    }
    return r;
}

The general algorithm tends to be computing the float power as the combination of the integer power and the remaining root.一般算法倾向于将浮点幂计算为整数幂和剩余根的组合。 The integer power is fairly straightforward, the root can be computed using either Newton - Raphson method or Taylor series .整数幂相当简单,可以使用Newton-Raphson 方法Taylor 级数来计算根。 IIRC numerical recipes in C has some text on this. C 中的 IIRC 数值配方对此有一些文字。 There are other (potentially better) methods for doing this too, but this would make a reasonable starting point for what is a surprisingly complex problem to implement.还有其他(可能更好)的方法可以做到这一点,但这将为实现一个令人惊讶的复杂问题提供一个合理的起点。 Note also that some implementations use lookup tables and a number of tricks to reduce the computation required.还要注意,一些实现使用查找表和一些技巧来减少所需的计算。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM