简体   繁体   English

CUDA代码的优化技巧

[英]Optimization tips for a cuda code

I wrote a piece of code for computing Self Quotient Image (SQI) in MATLAB. 我写了一段代码在MATLAB中计算自商图像(SQI)。 And now i want to rewrite a part of it in parallel for speedup. 现在,我想并行重写其中一部分以加快速度。 this part of code is: 这部分代码是:

siz=15;
X=normalize8(X);
[a,b]=size(X);
filt = fspecial('gaussian',[siz siz],sigma);
padsize = floor(siz/2);
padX = padarray(X,[padsize, padsize],'symmetric','both');

t0 = tic; % -------------------------------------------------------------
Z=zeros(a,b);
for i=padsize+1:a+padsize
    for j=padsize+1:b+padsize
        region = padX(i-padsize:i+padsize, j-padsize:j+padsize);
        means= mean(region(:));
        M=return_step(region, means);
        filt1=filt.*M;

        summ=sum(sum(filt1));        

        filt1=(filt1/summ);
        Z(i-padsize,j-padsize)=(sum(sum(filt1.*region))/(siz*siz));
    end
end
toc(t0) % -------------------------------------------------------------

and return_step function: 和return_step函数:

function M=return_step(X, means)

[a,b]=size(X);
for i=1:a
    for j=1:b
        if X(i,j)>=means
            M(i,j)=1;
        end
    end
end

I wrote below kernel function: 我写了下面的内核函数:

__global__ void returnstep(const double* x, double* m, double* filt, int leng, double mean, int i, int j, int width)
{
    int idx=threadIdx.y*blockDim.x+threadIdx.x;
    if(idx>=leng) return;

    int ridx= (j+threadIdx.y)*width+threadIdx.x+i;
    double xval= x[ridx];
    if (xval>=mean) m[idx]=filt[idx]*xval;
    else            m[idx]=0;
}

and then changed the MATLAB code as follow: 然后按如下所示更改MATLAB代码:

kernel= parallel.gpu.CUDAKernel('returnstep.ptx', 'returnstep.cu');
kernel.ThreadBlockSize= [double(siz) double(siz) 1];
GM = gpuArray(zeros(siz,siz));
GpadX = gpuArray(padX);
Gfilt = gpuArray(filt);

%% Process image
t0 = tic; % -------------------------------------------------------------
Z=zeros(a,b);
for i=padsize+1:a+padsize
    for j=padsize+1:b+padsize
        means= mean(region(:));
        GM= feval(kernel, GpadX, GM, Gfilt, siz*siz, means, i-padsize-1, j-padsize-1, padXwidth);
        filt1=  gather(GM);

        summ=sum(sum(filt1));        

        filt1=(filt1/summ);
        Z(i-padsize,j-padsize)=(sum(sum(filt1))/(siz*siz));
    end
end
toc(t0) % -------------------------------------------------------------

my sequential code runs in 2.5s for a 330X200 image but the new parallel code's run time is 15s. 对于330X200的图片,我的顺序代码运行时间为2.5秒,但是新的并行代码的运行时间为15秒。 I don't know why???? 我不知道为什么? I need some advise for improving it. 我需要一些建议来改进它。 I am new in CUDA programming. 我是CUDA编程的新手。

> help gather
...
X = GATHER(A) when A is a GPUArray, X is an array in the local workspace
with the data transferred from the GPU device.
....

filt1 = gather(GM) is copying GM from the GPU to the CPU in every step, which is very inefficient. filt1 = collect(GM)在每个步骤中都将GM从GPU复制到CPU,这是非常低效的。 You should move the entire computation inside the loop nest, or preferably the entire loop nest to the GPU kernel. 您应该将整个计算移动到循环嵌套中,或者最好将整个循环嵌套移至GPU内核。 Otherwise you can forget about any speedup. 否则,您可能会忘记任何加速。

My evaluation under Sobel filter shows the CPU outperforms GPU on small images. 我在Sobel过滤器下的评估显示,在小图像上,CPU的性能优于GPU。 I think your image size is so small for comparison of CPU-GPU performance. 我认为您的图像大小太小,无法比较CPU-GPU性能。 Computation should be large enough to hide kernel and communication launch overhead. 计算应该足够大以隐藏内核和通信启动开销。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM