CUDA代码的优化技巧

Question

I wrote a piece of code for computing Self Quotient Image (SQI) in MATLAB. 我写了一段代码在MATLAB中计算自商图像（SQI）。 And now i want to rewrite a part of it in parallel for speedup. 现在，我想并行重写其中一部分以加快速度。 this part of code is: 这部分代码是：

siz=15;
X=normalize8(X);
[a,b]=size(X);
filt = fspecial('gaussian',[siz siz],sigma);
padsize = floor(siz/2);
padX = padarray(X,[padsize, padsize],'symmetric','both');

t0 = tic; % -------------------------------------------------------------
Z=zeros(a,b);
for i=padsize+1:a+padsize
    for j=padsize+1:b+padsize
        region = padX(i-padsize:i+padsize, j-padsize:j+padsize);
        means= mean(region(:));
        M=return_step(region, means);
        filt1=filt.*M;

        summ=sum(sum(filt1));        

        filt1=(filt1/summ);
        Z(i-padsize,j-padsize)=(sum(sum(filt1.*region))/(siz*siz));
    end
end
toc(t0) % -------------------------------------------------------------

and return_step function: 和return_step函数：

function M=return_step(X, means)

[a,b]=size(X);
for i=1:a
    for j=1:b
        if X(i,j)>=means
            M(i,j)=1;
        end
    end
end

I wrote below kernel function: 我写了下面的内核函数：

__global__ void returnstep(const double* x, double* m, double* filt, int leng, double mean, int i, int j, int width)
{
    int idx=threadIdx.y*blockDim.x+threadIdx.x;
    if(idx>=leng) return;

    int ridx= (j+threadIdx.y)*width+threadIdx.x+i;
    double xval= x[ridx];
    if (xval>=mean) m[idx]=filt[idx]*xval;
    else            m[idx]=0;
}

and then changed the MATLAB code as follow: 然后按如下所示更改MATLAB代码：

kernel= parallel.gpu.CUDAKernel('returnstep.ptx', 'returnstep.cu');
kernel.ThreadBlockSize= [double(siz) double(siz) 1];
GM = gpuArray(zeros(siz,siz));
GpadX = gpuArray(padX);
Gfilt = gpuArray(filt);

%% Process image
t0 = tic; % -------------------------------------------------------------
Z=zeros(a,b);
for i=padsize+1:a+padsize
    for j=padsize+1:b+padsize
        means= mean(region(:));
        GM= feval(kernel, GpadX, GM, Gfilt, siz*siz, means, i-padsize-1, j-padsize-1, padXwidth);
        filt1=  gather(GM);

        summ=sum(sum(filt1));        

        filt1=(filt1/summ);
        Z(i-padsize,j-padsize)=(sum(sum(filt1))/(siz*siz));
    end
end
toc(t0) % -------------------------------------------------------------

my sequential code runs in 2.5s for a 330X200 image but the new parallel code's run time is 15s. 对于330X200的图片，我的顺序代码运行时间为2.5秒，但是新的并行代码的运行时间为15秒。 I don't know why???? 我不知道为什么？ I need some advise for improving it. 我需要一些建议来改进它。 I am new in CUDA programming. 我是CUDA编程的新手。

Answer 1

> help gather
...
X = GATHER(A) when A is a GPUArray, X is an array in the local workspace
with the data transferred from the GPU device.
....

filt1 = gather(GM) is copying GM from the GPU to the CPU in every step, which is very inefficient. filt1 = collect（GM）在每个步骤中都将GM从GPU复制到CPU，这是非常低效的。 You should move the entire computation inside the loop nest, or preferably the entire loop nest to the GPU kernel. 您应该将整个计算移动到循环嵌套中，或者最好将整个循环嵌套移至GPU内核。 Otherwise you can forget about any speedup. 否则，您可能会忘记任何加速。

Answer 2

My evaluation under Sobel filter shows the CPU outperforms GPU on small images. 我在Sobel过滤器下的评估显示，在小图像上，CPU的性能优于GPU。 I think your image size is so small for comparison of CPU-GPU performance. 我认为您的图像大小太小，无法比较CPU-GPU性能。 Computation should be large enough to hide kernel and communication launch overhead. 计算应该足够大以隐藏内核和通信启动开销。

CUDA代码的优化技巧

问题描述

2 个解决方案

解决方案1
1 已采纳 2012-09-06 09:33:12

解决方案2
0 2012-09-10 12:41:17

CUDA代码的优化技巧

问题描述

2 个解决方案

解决方案1 1 已采纳 2012-09-06 09:33:12

解决方案2 0 2012-09-10 12:41:17

解决方案1
1 已采纳 2012-09-06 09:33:12

解决方案2
0 2012-09-10 12:41:17