简体   繁体   English

C 中三角函数的单精度参数缩减

[英]Single precision argument reduction for trigonometric functions in C

I have implemented some approximations for trigonometric functions (sin,cos,arctan) computed with single precision (32 bit floating point) in C. They are accurate to about +/- 2 ulp.我已经为 C 中的单精度(32 位浮点)计算的三角函数(sin、cos、arctan)实现了一些近似值。它们的精度约为 +/- 2 ulp。

My target device does not support any <cmath> or <math.h> methods.我的目标设备不支持任何<cmath><math.h>方法。 It does not provide a FMA, but a MAC ALU.它不提供FMA,而是提供MAC ALU。 ALU and LU compute in 32 bit format. ALU 和 LU 以 32 位格式计算。

My arctan approximation is actually a modified version of the approximation of N.juffa , which approximates arctan on the full range.我的 arctan 近似实际上是N.juffa 近似的修改版本,它在整个范围内近似于 arctan。 Sine and cosine function are accurate up to 2 ulp within the range [-pi,pi].正弦和余弦 function 在 [-pi,pi] 范围内精确到 2 ulp。

I am now aiming to provide a larger input range (as large as possible, ideally [FLT_MIN,FLT_MAX]) for sine and cosine, which leads me to argument reduction.我现在的目标是为正弦和余弦提供更大的输入范围(尽可能大,最好是 [FLT_MIN,FLT_MAX]),这使我减少了自变量。

I'm currently reading different papers like A RGUMENT REDUCTION FOR HUGE ARGUMENTS: Good to the Last Bit by K.C.Ng or the paper about this new argument reduction algorithm , but I wasn't able to derive an implementation from it.我目前正在阅读不同的论文,例如 K.C.Ng 的 A RGUMENT REDUCTION FOR HUGE ARGUMENTS:Good to the Last Bit或关于这种新参数缩减算法的论文,但我无法从中得出实现。

Also I want to mention two stackoverflow questions that refer to related problems: There is a approach with matlab and c++ which is based on the first paper I linked.另外,我想提两个涉及相关问题的计算器问题:有一种方法 matlab 和 c++是基于我链接的第一篇论文。 It is actually using matlab, cmath methods and it limits the input to [0,20.000].它实际上使用 matlab,cmath 方法并将输入限制为 [0,20.000]。 The other one was already mentioned in the comments.另一个已经在评论中提到了。 It is an approach to an implementation of sin and cos in C, using various c-libraries which are not available for me.这是一种在 C 中实现 sin 和 cos 的方法,使用了我无法使用的各种 c 库。 Since both posts are already several years old, there might be some new findings.由于这两个帖子已经有好几年了,因此可能会有一些新发现。

It seems like the algorithm mostly used in this case is to store the number of 2/pi accurate up to the needed number of bits, to be able to compute the modulo calculation accurately and simultaneously avoid cancellation.似乎在这种情况下主要使用的算法是将 2/pi 的数量精确地存储到所需的位数,以便能够准确地计算模计算并同时避免取消。 My device does not provide a large DMEM, which means large look-up tables with hundreds of bits are not possible.我的设备不提供大型 DMEM,这意味着具有数百位的大型查找表是不可能的。 This procedure is actually described on page 70 of this reference, which by the way provides a lot of useful informatin about floating point math.该过程实际上在 参考文献的第 70 页上进行了描述,顺便说一下,它提供了很多有关浮点数学的有用信息。

So my question is: Is there another efficient way to reduce the arguments for sine and cosine obtaining single precision avoiding large LUTs?所以我的问题是:是否有另一种有效的方法来减少 arguments 的正弦和余弦获得单精度避免大型 LUT? The papers mentioned above actually focus on double precision and use up to 1000 digits, which is not suitable for my usecase.上面提到的论文实际上专注于双精度并且最多使用 1000 位数字,这不适合我的用例。

I actually haven't found any implementation in C nor an implementation aiming single precision calculation, I would be grateful for any sorts of hints /links /examples...我实际上没有在 C 中找到任何实现,也没有找到针对单精度计算的实现,我将不胜感激任何类型的提示/链接/示例...

The following code is based on a previous answer in which I demonstrated how to perform a fairly accurate argument reduction for trigonometric functions by using the Cody-Waite method of split constants for arguments small in magnitude, and the Payne-Hanek method for arguments large in magnitude.下面的代码基于之前的答案,其中我演示了如何通过使用 Cody-Waite 方法对小参数进行拆分常量,以及对大参数使用 Payne-Hanek 方法,对三角函数执行相当准确的参数减少震级。 For details on the Payne-Hanek algorithm see there, for details on the Cody-Waite algorithm see this previous answer of mine.有关 Payne-Hanek 算法的详细信息,请参阅此处,有关 Cody-Waite 算法的详细信息,请参阅我之前的答案

Here I have made adjustments necessary to adjust to the restrictions of the asker's platform, in that no 64-bit types are supported, fused multiply-add is not supported, and helper functions from math.h are not available.在这里,我进行了必要的调整以适应提问者平台的限制,因为不支持 64 位类型,不支持融合乘加,并且math.h中的辅助函数不可用。 I am assuming that float maps to IEEE-754 binary32 format, and that there is a way to re-interpret such a 32-bit float as a 32-bit unsigned integer and vice versa.我假设float映射到 IEEE-754 binary32格式,并且有一种方法可以将这种 32 位浮点数重新解释为 32 位无符号整数,反之亦然。 I have implemented this re-interpretation via the standard portable idiom, that is, by using memcpy() , but other methods may be chosen appropriate for the unspecified target platform, such as inline assembly, machine-specific intrinsics, or volatile unions.我已经通过标准的可移植习惯用法实现了这种重新解释,即通过使用memcpy() ,但可以选择适合未指定目标平台的其他方法,例如内联汇编、特定于机器的内在函数或易失性联合。

Since this code is basically a port of my previous code to a more restrictive environment, it lacks perhaps the elegance of a de novo design specifically targeted at that environment.由于这段代码基本上是我以前的代码到一个更严格的环境的移植,它可能缺乏专门针对该环境的从头设计的优雅。 I have basically replaced the frexp() helper function from math.h with some bit twiddling, emulated 64-bit integer computation with pairs of 32-bit integers, replaced the double-precision computation with 32-bit fixed-point computation (which worked much better than I had anticipated), and replaced all FMAs with the unfused equivalent.我基本上已经用一些位frexp()替换了math.hfrexp()辅助函数,用 32 位整数对模拟了 64 位整数计算,用 32 位定点计算替换了双精度计算(有效)比我预期的要好得多),并用未融合的等效项替换了所有 FMA。

Re-working the Cody-Waite portion of the argument reduction took quite a bit of work.重新处理参数减少的 Cody-Waite 部分需要大量的工作。 Clearly, without FMA available, we need to ensure a sufficient number of trailing zero bits in the constituent parts of the constant π/2 (except the least significant one) to make sure the products are exact.显然,如果没有可用的 FMA,我们需要确保常数 π/2 的组成部分中有足够数量的尾随零位(最低有效位除外),以确保乘积是准确的。 I spent several hours experimentally puzzling out a particular split that delivers accurate results but also pushes the switchover point to the Payne-Hanek method as high as possible.我花了几个小时实验性地弄清了一个特定的拆分,它提供了准确的结果,但也尽可能地将切换点推到了 Payne-Hanek 方法。

When USE_FMA = 1 is specified, the output of the test app, when compiled with a high-quality math library, should look similar to this:当指定USE_FMA = 1 ,测试应用程序的输出在使用高质量数学库编译时应类似于以下内容:

Testing sinf ...  PASSED. max ulp err = 1.493253  diffsum = 337633490
Testing cosf ...  PASSED. max ulp err = 1.495098  diffsum = 342020968

With USE_FMA = 0 the accuracy changes slightly for the worse:USE_FMA = 0 ,精度会稍微变差:

Testing sinf ...  PASSED. max ulp err = 1.498012  diffsum = 359702532
Testing cosf ...  PASSED. max ulp err = 1.504061  diffsum = 364682650

The diffsum output is a rough indicator of overall accuracy, here showing that about 90% of all inputs result in a correctly rounded single-precision response. diffsum输出是总体准确度的粗略指标,此处显示所有输入的大约 90% 导致正确舍入的单精度响应。

Note that it is important to compile the code with the strictest floating-point settings and highest degree of adherence to IEEE-754 the compiler offers.请注意,使用编译器提供的最严格的浮点设置和最高程度的 IEEE-754 来编译代码非常重要。 For the Intel compiler that I used to develop and test this code, that can be achieved by compiling with /fp:strict .对于我用来开发和测试此代码的英特尔编译器,可以通过编译/fp:strict来实现。 Also, the quality of the math library used for reference is crucial for accurate assessment of the ulp error of this single-precision code.此外,用于参考的数学库的质量对于准确评估此单精度代码的 ulp 误差至关重要。 The Intel compiler comes with a math library that provides double-precision elementary math functions with just slightly over 0.5 ulp error in the HA (high accuracy) variant.英特尔编译器附带一个数学库,该库提供双精度基本数学函数,HA(高精度)变体中的误差略高于 0.5 ulp。 Use of a multi-precision reference library may be preferable but would have slowed me down too much here.使用多精度参考库可能更可取,但在这里会减慢我的速度。

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>   // for memcpy()
#include <math.h>     // for test purposes, and when PORTABLE=1 or USE_FMA=1

#define USE_FMA   (0) // use fmaf() calls for arithmetic
#define PORTABLE  (0) // allow helper functions from math.h
#define HAVE_U64  (0) // 64-bit integer type available
#define CW_STAGES (3) // number of stages in Cody-Waite reduction when USE_FMA=0

#if USE_FMA
#define SIN_RED_SWITCHOVER  (117435.992f)
#define COS_RED_SWITCHOVER  (71476.0625f)
#define MAX_DIFF            (1)
#else // USE_FMA
#if CW_STAGES == 2
#define SIN_RED_SWITCHOVER  (3.921875f)
#define COS_RED_SWITCHOVER  (3.921875f)
#elif CW_STAGES == 3
#define SIN_RED_SWITCHOVER  (201.15625f)
#define COS_RED_SWITCHOVER  (142.90625f)
#endif // CW_STAGES
#define MAX_DIFF            (2)
#endif // USE_FMA

/* re-interpret the bit pattern of an IEEE-754 float as a uint32 */
uint32_t float_as_uint32 (float a)
{
    uint32_t r;
    memcpy (&r, &a, sizeof r);
    return r;
}

/* re-interpret the bit pattern of a uint32 as an IEEE-754 float */
float uint32_as_float (uint32_t a)
{
    float r;
    memcpy (&r, &a, sizeof r);
    return r;
}

/* Compute the upper 32 bits of the product of two unsigned 32-bit integers */
#if HAVE_U64
uint32_t umul32_hi (uint32_t a, uint32_t b)
{
    return (uint32_t)(((uint64_t)a * b) >> 32);
}
#else // HAVE_U64
/* Henry S. Warren, "Hacker's Delight, 2nd ed.", Addison-Wesley 2012. Fig. 8-2 */
uint32_t umul32_hi (uint32_t a, uint32_t b)
{
    uint16_t a_lo = (uint16_t)a;
    uint16_t a_hi = a >> 16;
    uint16_t b_lo = (uint16_t)b;
    uint16_t b_hi = b >> 16;
    uint32_t p0 = (uint32_t)a_lo * b_lo;
    uint32_t p1 = (uint32_t)a_lo * b_hi;
    uint32_t p2 = (uint32_t)a_hi * b_lo;
    uint32_t p3 = (uint32_t)a_hi * b_hi;
    uint32_t t = (p0 >> 16) + p1;
    return (t >> 16) + (((uint32_t)(uint16_t)t + p2) >> 16) + p3;
}
#endif // HAVE_U64

/* 190 bits of 2/PI for Payne-Hanek style argument reduction. */
const uint32_t two_over_pi_f [] = 
{
    0x28be60db,
    0x9391054a,
    0x7f09d5f4,
    0x7d4d3770,
    0x36d8a566,
    0x4f10e410
};

/* Reduce a trig function argument using the slow Payne-Hanek method */
float trig_red_slowpath_f (float a, int *quadrant)
{
    uint32_t ia, hi, mid, lo, tmp, i, l, h, plo, phi;
    int32_t e, q;
    float r;

#if PORTABLE
    ia = (uint32_t)(fabsf (frexpf (a, &e)) * 0x1.0p32f); // 4.29496730e+9
#else // PORTABLE
    ia = ((float_as_uint32 (a) & 0x007fffff) << 8) | 0x80000000;
    e = ((float_as_uint32 (a) >> 23) & 0xff) - 126;
#endif // PORTABLE
    
    /* compute product x * 2/pi in 2.62 fixed-point format */
    i = (uint32_t)e >> 5;
    e = (uint32_t)e & 31;

    hi  = i ? two_over_pi_f [i-1] : 0;
    mid = two_over_pi_f [i+0];
    lo  = two_over_pi_f [i+1];
    tmp = two_over_pi_f [i+2];
 
    if (e) {
        hi  = (hi  << e) | (mid >> (32 - e));
        mid = (mid << e) | (lo  >> (32 - e));
        lo  = (lo  << e) | (tmp >> (32 - e));
    }

    /* compute 64-bit product phi:plo */
    phi = 0;
    l = ia * lo;
    h = umul32_hi (ia, lo);
    plo = phi + l;
    phi = h + (plo < l);
    l = ia * mid;
    h = umul32_hi (ia, mid);
    plo = phi + l;
    phi = h + (plo < l);
    l = ia * hi;
    phi = phi + l;

    /* split fixed-point result into integer and fraction portions */
    q = phi >> 30;               // integral portion = quadrant<1:0>
    phi = phi & 0x3fffffff;      // fraction
    if (phi & 0x20000000) {      // fraction >= 0.5
        phi = phi - 0x40000000;  // fraction - 1.0
        q = q + 1;
    }

    /* compute remainder of x / (pi/2) */
#if USE_FMA
    float phif, plof, chif, clof, thif, tlof;
    phif = 0x1.0p27f * (float)(int32_t)(phi & 0xffffffe0);
    plof = (float)((plo >> 5) | (phi << (32-5)));
    thif = phif + plof;
    plof = (phif - thif) + plof;
    phif = thif;
    chif =  0x1.921fb6p-57f; // (1.5707963267948966 * 0x1.0p-57)_hi
    clof = -0x1.777a5cp-82f; // (1.5707963267948966 * 0x1.0p-57)_lo
    thif = phif * chif;
    tlof = fmaf (phif, chif, -thif);
    tlof = fmaf (phif, clof, tlof);
    tlof = fmaf (plof, chif, tlof);
    r = thif + tlof;
#else // USE_FMA
    /* record sign of fraction */
    uint32_t s = phi & 0x80000000;
    
    /* take absolute value of fraction */
    if ((int32_t)phi < 0) {
        phi = ~phi;
        plo = 0 - plo;
        phi += (plo == 0);
    }
    
    /* normalize fraction */
    e = 0;
    while ((int32_t)phi > 0) {
        phi = (phi << 1) | (plo >> 31);
        plo = plo << 1;
        e--;
    }
    
    /* multiply 32 high-order bits of fraction with pi/2 */
    phi = umul32_hi (phi, 0xc90fdaa2); // (uint32_t)rint(PI/2 * 2**31)
    
    /* normalize product */
    if ((int32_t)phi > 0) {
        phi = phi << 1;
        e--;
    }

    /* round and convert to floating point */
    uint32_t ri = s + ((e + 128) << 23) + (phi >> 8) + ((phi & 0xff) > 0x7e);
    r = uint32_as_float (ri);
#endif // USE_FMA
    if (a < 0.0f) {
        r = -r;
        q = -q;
    }

    *quadrant = q;
    return r;
}

/* Argument reduction for trigonometric functions that reduces the argument
   to the interval [-PI/4, +PI/4] and also returns the quadrant. It returns 
   -0.0f for an input of -0.0f 
*/
float trig_red_f (float a, float switch_over, int *q)
{    
    float j, r;

    if (fabsf (a) > switch_over) {
        /* Payne-Hanek style reduction. M. Payne and R. Hanek, "Radian reduction
           for trigonometric functions". SIGNUM Newsletter, 18:19-24, 1983
        */
        r = trig_red_slowpath_f (a, q);
    } else {
        /* Cody-Waite style reduction. W. J. Cody and W. Waite, "Software Manual
           for the Elementary Functions", Prentice-Hall 1980
        */
#if USE_FMA
        j = fmaf (a, 0x1.45f306p-1f, 0x1.8p+23f) - 0x1.8p+23f; // 6.36619747e-1, 1.25829120e+7
        r = fmaf (j, -0x1.921fb0p+00f, a); // -1.57079601e+00 // pio2_high
        r = fmaf (j, -0x1.5110b4p-22f, r); // -3.13916473e-07 // pio2_mid
        r = fmaf (j, -0x1.846988p-48f, r); // -5.39030253e-15 // pio2_low
#else // USE_FMA
        j = (a * 0x1.45f306p-1f + 0x1.8p+23f) - 0x1.8p+23f; // 6.36619747e-1, 1.25829120e+7
#if CW_STAGES == 2
        r = a - j * 0x1.921fb4p+0f;  // pio2_high
        r = r - j * 0x1.4442d2p-24f; // pio2_low
#elif CW_STAGES == 3
        r = a - j * 0x1.921f00p+00f; // 1.57078552e+00 // pio2_high
        r = r - j * 0x1.6a8880p-17f; // 1.08043314e-05 // pio2_mid
        r = r - j * 0x1.68c234p-39f; // 2.56334407e-12 // pio2_low
#endif // CW_STAGES
#endif // USE_FMA
        *q = (int)j;
    }
    return r;
}

/* Approximate sine on [-PI/4,+PI/4]. Maximum ulp error with USE_FMA = 0.64196
   Returns -0.0f for an argument of -0.0f
   Polynomial approximation based on T. Myklebust, "Computing accurate 
   Horner form approximations to special functions in finite precision
   arithmetic", http://arxiv.org/abs/1508.03211, retrieved on 8/29/2016
*/
float sinf_poly (float a, float s)
{
    float r, t;
#if USE_FMA
    r =              0x1.80a000p-19f;  //  2.86567956e-6
    r = fmaf (r, s, -0x1.a0690cp-13f); // -1.98559923e-4
    r = fmaf (r, s,  0x1.111182p-07f); //  8.33338592e-3
    r = fmaf (r, s, -0x1.555556p-03f); // -1.66666672e-1
    t = fmaf (a, s, 0.0f); // ensure -0 is passed through
    r = fmaf (r, t, a);
#else // USE_FMA
    r =         0x1.80a000p-19f; //  2.86567956e-6
    r = r * s - 0x1.a0690cp-13f; // -1.98559923e-4
    r = r * s + 0x1.111182p-07f; //  8.33338592e-3
    r = r * s - 0x1.555556p-03f; // -1.66666672e-1
    t = a * s + 0.0f; // ensure -0 is passed through
    r = r * t + a;
#endif // USE_FMA
    return r;
}

/* Approximate cosine on [-PI/4,+PI/4]. Maximum ulp error with USE_FMA = 0.87444 */
float cosf_poly (float s)
{
    float r;
#if USE_FMA
    r =              0x1.9a8000p-16f;  //  2.44677067e-5
    r = fmaf (r, s, -0x1.6c0efap-10f); // -1.38877297e-3
    r = fmaf (r, s,  0x1.555550p-05f); //  4.16666567e-2
    r = fmaf (r, s, -0x1.000000p-01f); // -5.00000000e-1
    r = fmaf (r, s,  0x1.000000p+00f); //  1.00000000e+0
#else // USE_FMA
    r =         0x1.9a8000p-16f; //  2.44677067e-5
    r = r * s - 0x1.6c0efap-10f; // -1.38877297e-3
    r = r * s + 0x1.555550p-05f; //  4.16666567e-2
    r = r * s - 0x1.000000p-01f; // -5.00000000e-1
    r = r * s + 0x1.000000p+00f; //  1.00000000e+0
#endif // USE_FMA
    return r;
}

/* Map sine or cosine value based on quadrant */
float sinf_cosf_core (float a, int i)
{
    float r, s;

    s = a * a;
    r = (i & 1) ? cosf_poly (s) : sinf_poly (a, s);
    if (i & 2) {
        r = 0.0f - r; // don't change "sign" of NaNs
    }
    return r;
}

/* maximum ulp error with USE_FMA = 1: 1.495098  */
float my_sinf (float a)
{
    float r;
    int i;

    a = a * 0.0f + a; // inf -> NaN
    r = trig_red_f (a, SIN_RED_SWITCHOVER, &i);
    r = sinf_cosf_core (r, i);
    return r;
}

/* maximum ulp error with USE_FMA = 1: 1.493253 */
float my_cosf (float a)
{
    float r;
    int i;

    a = a * 0.0f + a; // inf -> NaN
    r = trig_red_f (a, COS_RED_SWITCHOVER, &i);
    r = sinf_cosf_core (r, i + 1);
    return r;
}

/* re-interpret bit pattern of an IEEE-754 double as a uint64 */
uint64_t double_as_uint64 (double a)
{
    uint64_t r;
    memcpy (&r, &a, sizeof r);
    return r;
}

double floatUlpErr (float res, double ref)
{
    uint64_t i, j, err, refi;
    int expoRef;
    
    /* ulp error cannot be computed if either operand is NaN, infinity, zero */
    if (isnan (res) || isnan (ref) || isinf (res) || isinf (ref) ||
        (res == 0.0f) || (ref == 0.0f)) {
        return 0.0;
    }
    /* Convert the float result to an "extended float". This is like a float
       with 56 instead of 24 effective mantissa bits.
    */
    i = ((uint64_t)float_as_uint32(res)) << 32;
    /* Convert the double reference to an "extended float". If the reference is
       >= 2^129, we need to clamp to the maximum "extended float". If reference
       is < 2^-126, we need to denormalize because of the float types's limited
       exponent range.
    */
    refi = double_as_uint64(ref);
    expoRef = (int)(((refi >> 52) & 0x7ff) - 1023);
    if (expoRef >= 129) {
        j = 0x7fffffffffffffffULL;
    } else if (expoRef < -126) {
        j = ((refi << 11) | 0x8000000000000000ULL) >> 8;
        j = j >> (-(expoRef + 126));
    } else {
        j = ((refi << 11) & 0x7fffffffffffffffULL) >> 8;
        j = j | ((uint64_t)(expoRef + 127) << 55);
    }
    j = j | (refi & 0x8000000000000000ULL);
    err = (i < j) ? (j - i) : (i - j);
    return err / 4294967296.0;
}

int main (void) 
{
    float arg, res, reff;
    uint32_t argi, resi, refi;
    int64_t diff, diffsum;
    double ref, ulp, maxulp;

    printf ("Testing sinf ...  ");
    diffsum = 0;
    maxulp = 0;
    argi = 0;
    do {
        arg = uint32_as_float (argi);
        res = my_sinf (arg);
        ref = sin ((double)arg);
        reff = (float)ref;
        resi = float_as_uint32 (res);
        refi = float_as_uint32 (reff);
        ulp = floatUlpErr (res, ref);
        if (ulp > maxulp) {
            maxulp = ulp;
        }
        diff = (resi > refi) ? (resi - refi) : (refi - resi);
        if (diff > MAX_DIFF) {
            printf ("\nerror @ %08x (% 15.8e): res=%08x (% 15.8e)  ref=%08x (%15.8e)\n", argi, arg, resi, res, refi, reff);
            return EXIT_FAILURE;
        }
        diffsum = diffsum + diff;
        argi++;
    } while (argi);
    printf ("PASSED. max ulp err = %.6f  diffsum = %lld\n", maxulp, diffsum);

    printf ("Testing cosf ...  ");
    diffsum = 0;
    maxulp = 0;
    argi = 0;
    do {
        arg = uint32_as_float (argi);
        res = my_cosf (arg);
        ref = cos ((double)arg);
        reff = (float)ref;
        resi = float_as_uint32 (res);
        refi = float_as_uint32 (reff);
        ulp = floatUlpErr (res, ref);
        if (ulp > maxulp) {
            maxulp = ulp;
        }
        diff = (resi > refi) ? (resi - refi) : (refi - resi);
        if (diff > MAX_DIFF) {
            printf ("\nerror @ %08x (% 15.8e): res=%08x (% 15.8e)  ref=%08x (%15.8e)\n", argi, arg, resi, res, refi, reff);
            return EXIT_FAILURE;
        }
        diffsum = diffsum + diff;
        argi++;
    } while (argi);
    diffsum = diffsum + diff;
    printf ("PASSED. max ulp err = %.6f  diffsum = %lld\n", maxulp, diffsum);
    return EXIT_SUCCESS;
}

There's a thread on Mathematics forum where user J. M. ain't a mathematician introduced improved Taylor/Padé idea to approximate cos and sin functions in range [-pi,pi]. 数学论坛上有一个帖子,其中用户J. M. 不是数学家介绍了改进的 Taylor/Padé 想法来近似 [-pi,pi] 范围内的 cos 和 sin 函数。 Here's sine version translated to C++. This approximation is not as fast as library std::sin() function but might be worth to check if SSE/AVX/FMA implementation helps enough with the speed.这是转换为 C++ 的正弦版本。此近似值不如库 std::sin() function 快,但可能值得检查 SSE/AVX/FMA 实现是否对速度有足够帮助。

I have not tested ULP error against library sin() nor cos() function but by Julia Function Accuracy Test tool it looks like an excellent approximation method (add below code to the runtest.jl module which belongs to the Julia test suite):我没有针对库 sin() 或 cos() function 测试 ULP 错误,但通过Julia Function 精度测试工具,它看起来像是一个很好的近似方法(将以下代码添加到属于 Julia 测试套件的 runtest.jl 模块):

function test_sine(x::AbstractFloat)  
 f=0.5  
 z=x*0.5
 k=0
    while (abs(z)>f)
        z*=0.5
        k=k+1  
    end 
    z2=z^2;  
    r=z*(1+(z2/105-1)*((z/3)^2))/  
          (1+(z2/7-4)*((z/3)^2));  
    while(k > 0)
        r = (2*r)/(1-r*r);  
        k=k-1
    end
    return (2*r)/(1+r*r)
 end

function test_cosine(x::AbstractFloat)  
f=0.5  
z=x*0.5
k=0
   while (abs(z)>f)
       z*=0.5
       k=k+1  
   end 
   z2=z^2;  
   r=z*(1+(z2/105-1)*((z/3)^2))/  
      (1+(z2/7-4)*((z/3)^2));  
   while (k > 0)
       r = (2*r)/(1-r*r);  
       k=k-1
   end
   return (1-r*r)/(1+r*r)
end  

  
pii = 3.141592653589793238462643383279502884

MAX_SIN(n::Val{pii}, ::Type{Float16}) = 3.1415926535897932f0
MAX_SIN(n::Val{pii}, ::Type{Float32}) = 3.1415926535897932f0
#MAX_SIN(n::Val{pii}, ::Type{Float64}) = 3.141592653589793238462643383279502884
MIN_SIN(n::Val{pii}, ::Type{Float16}) = -3.1415926535897932f0
MIN_SIN(n::Val{pii}, ::Type{Float32}) = -3.1415926535897932f0
#MIN_SIN(n::Val{pii}, ::Type{Float64}) = -3.141592653589793238462643383279502884

for (func, base) in (sin=>Val(pii), test_sine=>Val(pii), cos=>Val(pii), test_cosine=>Val(pii))    
    for T in (Float16, Float32)
        xx = range(MIN_SIN(base,T),  MAX_SIN(base,T), length = 10^6);
        test_acc(func, xx)
    end
end

Results for approximation and sin() and cos() in range [-pi,pi]: [-pi,pi] 范围内的近似值和 sin() 和 cos() 的结果:

Tol debug failed 0.0% of the time.
sin
ULP max 0.5008857846260071 at x = 2.203355
ULP mean 0.24990503381476237
Test Summary: | Pass  Total
Float32 sin   |    1      1
Tol debug failed 0.0% of the time.
sin
ULP max 0.5008857846260071 at x = 2.203355
ULP mean 0.24990503381476237
Test Summary: | Pass  Total
Float32 sin   |    1      1
Tol debug failed 0.0% of the time.
test_sine
ULP max 0.001272978144697845 at x = 2.899093
ULP mean 1.179825295005716e-8
Test Summary:     | Pass  Total
Float32 test_sine |    1      1
Tol debug failed 0.0% of the time.
test_sine
ULP max 0.001272978144697845 at x = 2.899093
ULP mean 1.179825295005716e-8
Test Summary:     | Pass  Total
Float32 test_sine |    1      1
Tol debug failed 0.0% of the time.
cos
ULP max 0.5008531212806702 at x = 0.45568538
ULP mean 0.2499933592458589
Test Summary: | Pass  Total
Float32 cos   |    1      1
Tol debug failed 0.0% of the time.
cos
ULP max 0.5008531212806702 at x = 0.45568538
ULP mean 0.2499933592458589
Test Summary: | Pass  Total
Float32 cos   |    1      1
Tol debug failed 0.0% of the time.
test_cosine
ULP max 0.0011584102176129818 at x = 1.4495481
ULP mean 1.6793535615395134e-8
Test Summary:       | Pass  Total
Float32 test_cosine |    1      1
Tol debug failed 0.0% of the time.
test_cosine
ULP max 0.0011584102176129818 at x = 1.4495481
ULP mean 1.6793535615395134e-8
Test Summary:       | Pass  Total
Float32 test_cosine |    1      1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM