在OpenCL中快速实现二进制求幂

Question

I've been trying to design a fast binary exponentiation implementation in OpenCL. 我一直在尝试在OpenCL中设计一个快速二进制求幂实现。 My current implementation is very similar to the one in this book about pi . 我目前的实现与本书中关于pi的实现非常类似。

// Returns 16^n mod ak
inline double expm (long n, double ak)
{
    double r = 16.0;
    long nt;

    if (ak == 1) return 0.;
    if (n == 0) return 1;
    if (n == 1) return fmod(16.0, ak);

    for (nt=1; nt <= n; nt <<=1);

    nt >>= 2;

    do
    {
        r = fmod(r*r, ak);
        if ((n & nt) != 0)
            r = fmod(16.0*r, ak);
        nt >>= 1;
    } while (nt != 0);
    return r;
}

Is there room for improvement? 还有改进的余地吗？ Right now my program is spending the vast majority of it's time in this function. 现在我的程序花费了大部分时间在这个功能上。

Answer 1

My first thought is to vectorize it, for a potential speed up of ~1.6x. 我的第一个想法是对它进行矢量化，潜在的速度可达~1.6倍。 This uses 5 multiplies per loop compared to 2 multiplies in the original, but with approximately a quarter the number of loops for sufficiently large N. Converting all the double s to long s, and swapping out the fmod s for % s may provide some speed up depending on the exact GPU used and whatever. 每循环使用5次乘法，而原始使用2次乘，但是对于足够大的N，循环次数大约为四分之一。将所有double s转换为long s，并且为% s换出fmod可以提供一些速度取决于使用的确切GPU和任何。

inline double expm(long n, double ak) {

    double4 r = (1.0, 1.0, 1.0, 1.0);
    long4 ns = n & (0x1111111111111111, 0x2222222222222222, 0x4444444444444444,
            0x8888888888888888);
    long nt;

    if(ak == 1) return 0.;

    for(nt=15; nt<n; nt<<=4); //This can probably be vectorized somehow as well.

    do {
        double4 tmp = r*r;
        tmp = tmp*tmp;
        tmp = tmp*tmp;
        r = fmod(tmp*tmp, ak); //Raise it to the 16th power, 
                                       //same as multiplying the exponent 
                                       //(of the result) by 16, same as
                                       //bitshifting the exponent to the right 4 bits.

        r = select(fmod(r*(16.0,256.0,65536.0, 4294967296.0), ak), r, (ns & nt) - 1);
        nt >>= 4;
    } while(nt != 0); //Process n four bits at a time.

    return fmod(r.x*r.y*r.z*r.w, ak); //And then combine all of them.
}

Edit: I'm pretty sure it works now. 编辑：我很确定它现在有效。

Answer 2

The loop to extract nt = log2(n); 提取nt = log2(n);的循环nt = log2(n); can be replaced by 可以替换为
if (n & 1) ...; n >>= 1;
in the do-while loop. 在do-while循环中。
Given that initially r = 16; 鉴于最初 r = 16; , fmod(r*r, ak) vs fmod(16*r,ak) can be easily delayed to calculate the modulo only every Nth iteration or so -- Loop unrolling? ，fmod（r * r，ak）vs fmod（16 * r，ak）可以很容易地延迟，只计算每第N次迭代的模数 - 循环展开？
Also why fmod? 也为什么fmod？

在OpenCL中快速实现二进制求幂

问题描述

2 个解决方案

解决方案1
2 已采纳 2014-01-21 13:03:37

解决方案2
0 2014-01-21 05:10:06

在OpenCL中快速实现二进制求幂

问题描述

2 个解决方案

解决方案1 2 已采纳 2014-01-21 13:03:37

解决方案2 0 2014-01-21 05:10:06

解决方案1
2 已采纳 2014-01-21 13:03:37

解决方案2
0 2014-01-21 05:10:06