简体   繁体   English

CUDA代码优化; 寄存器数

[英]CUDA code optimization; Number of registers

I am pasting some code here for everyone to see. 我在这里粘贴一些代码,以供所有人查看。

__global__ void Integrate(double a, double b) {
    __shared__ double extrapol[16];
    __shared__ double result[32];
    __shared__ double h;
    __shared__ double err;

    __shared__ double x;
    __shared__ int n;

    if (threadIdx.x == 0) {
        h       = b - a;
        err     = 1.0;

        if (0.0 == a)
            extrapol[0] = 0.5 * h * myfunc(b);
        else
            extrapol[0] = 0.5 * h * (myfunc(a) + myfunc(b));

        n = 1;
    }

    for (int i = 1; i < 16; i++) {
        if (threadIdx.x == 0)
            x = a + h * 0.5;

        __syncthreads();

        if (err <= EPSILON)
            break;

        Trapezoid(result, x, h, n);
        if (threadIdx.x == 0) {
            result[0] = (extrapol[0] + h * result[0]) * 0.5;

            double power = 1.0;
            for (int k = 0; k < i; k++) {
               power *= 4.0;
               double sum  = (power * result[0] - extrapol[k]) / (power - 1.0);
               extrapol[k] = result[0];
               result[0] = sum;
            }

            err = fabs(result[0] - extrapol[i - 1]);
            extrapol[i] = result[0];
            n *= 2;
            h *= 0.5;
         }
    }
}

Essentially it is an adaptive numberical integrator (Romberg). 本质上,它是自适应数字积分器(Romberg)。 The device functions used in this global functions are: 该全局功能中使用的设备功能是:

__device__ void Trapezoid(double *sdata, double x, double h, int n) {
    int nIdx = threadIdx.x + blockIdx.x * blockDim.x;
    sdata[nIdx] = 0;

    while (nIdx < n) {
       sdata[threadIdx.x] += myfunc(x + (nIdx * h));
       nIdx += 32;
    }
    Sum(sdata, threadIdx.x);
}

Parallel reduction function: 并行缩小功能:

 __device__ void Sum(volatile double *sdata, int tId) {
     if (tId < 16) {
         sdata[tId] += sdata[tId + 16];
         sdata[tId] += sdata[tId + 8];
         sdata[tId] += sdata[tId + 4];
         sdata[tId] += sdata[tId + 2];
         sdata[tId] += sdata[tId + 1];
     }
}

And finally the function I am trying to integrate is (mock up simple function) given as: 最后,我尝试集成的功能(模拟简单功能)为:

__device__ double myfunc(double x) {
     return 1 / x;
}

The code executes well and the expected integral is obtained. 代码执行良好,并获得了预期的积分。 The kernel is executed in the following manner (for now) 内核以以下方式执行(目前)

Integrate <<< 1, 32 >>>(1, 2);

Question: 题:
When I use nvidia visual profiler to check out the usage of registers for this function. 当我使用nvidia visual profiler来检查该功能的寄存器使用情况时。 It turns out to be 52 registers per thread. 原来每个线程有52个寄存器。 I don't understand why? 我不明白为什么? Most of the variables I have in this code are shared variables. 我在此代码中拥有的大多数变量都是共享变量。 Can you let me know how can I find out which parts of my code are using registers? 您能告诉我如何找出代码的哪些部分正在使用寄存器吗?

How can I reduce them? 如何减少它们? Is there any optimization that I can do with this code? 我可以使用此代码进行任何优化吗?

Hardware 硬件

I am using fermi device Geforce GTX 470, compute capability 2.0 我正在使用Fermi设备Geforce GTX 470,计算能力为2.0

Thanks, 谢谢,

The register usage is not immediately related to the number of defined variables since, for example, registers are used to store the results for intermediate calculations where a variable is not being defined. 寄存器的使用与所定义变量的数量没有直接关系,因为例如,寄存器用于存储未定义变量的中间计算结果。

One possibility to try to spot the parts of the code mostly using registers is to try hacking the ptx file by manually annotating it with a syntax like 尝试主要使用寄存器来发现代码部分的一种可能性是,尝试通过使用如下语法手动注释ptx文件来破解它:

asm volatile ("// code at this line is doing this and this ..."); 

You can use the ptxas program to analyze your ptx files to show you register and memory usage of each function. 您可以使用ptxas程序来分析ptx文件,以显示每个功能的注册和内存使用情况。 In your case you'd want to do ptxas --gpu-name sm_20 -v code.ptx . 在您的情况下,您需要执行ptxas --gpu-name sm_20 -v code.ptx

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM