比较Matlab与CUDA的相关性并简化2D阵列

Question

I am trying to compare cross-correlation using FFT vs using windowing method. 我正在尝试比较使用FFT与使用开窗方法的互相关。

My Matlab code is: 我的Matlab代码是：

isize = 20;
n = 7;
for i = 1:n %%7x7 xcorr
  for j = 1:n
    xcout(i,j) = sum(sum(ffcorr1 .* ref(i:i+isize-1,j:j+isize-1))); %%ref is 676 element array and ffcorr1 is a 400 element array
  end
end

similar CUDA kernel: 类似的CUDA内核：

__global__ void xc_corr(double* in_im, double* ref_im, int pix3, int isize, int n, double* out1, double* temp1, double* sum_temp1)
{

    int p = blockIdx.x * blockDim.x + threadIdx.x;
    int q = 0;
    int i = 0;
    int j = 0;
    int summ = 0;

    for(i = 0; i < n; ++i)
    {
        for(j = 0; j < n; ++j)
        {
            summ  = 0; //force update
            for(p = 0; p < pix1; ++p)
            {
                for(q = 0; q < pix1; ++q)
                {
                    temp1[((i*n+j)*pix1*pix1)+p*pix1+q] = in_im[p*pix1+q] * ref_im[(p+i)*pix1+(q+j)];               
                    sum_temp1[((i*n+j)*pix1*pix1)+p*pix1+q] += temp1[((i*n+j)*pix1*pix1)+p*pix1+q];
                    out1[i*n+j] = sum_temp1[((i*n+j)*pix1*pix1)+p*pix1+q];
                }
            }       
        }
    }

I have called this in my kernel as 我在内核中将此称为

int blocksize = 64; //multiple of 32
int nblocks = (pix3+blocksize-1)/blocksize; //round to max pix3 = 400
xc_corr <<< nblocks,blocksize >>> (ffcorr1, ref_d, pix3, isize, npix, xcout, xc_partial);
cudaThreadSynchronize();

Somehow, when I do a diff on the output file, I see that the CUDA kernel computes for only the first 400 elements. 不知何故，当我对输出文件进行比较时，我看到CUDA内核仅计算前400个元素。

What is the correct way to write this kernel?? 编写此内核的正确方法是什么？

Also, what is the difference in declaring i,j as shown below in my kernel?? 另外，在内核中声明i，j有什么区别？

int i = blockIdx.x * blockDim.y + threadIdx.x * threadIdx.y;
int j = blockIdx.y * blockDim.x + threadIdx.x * threadIdx.y;

Answer 1

int blocksize = 64; //multiple of 32
int nblocks = (pix3+blocksize-1)/blocksize; //round to max pix3 = 400
xc_corr <<< nblocks,blocksize >>> (ffcorr1, ref_d, pix3, isize, npix, xcout, xc_partial);

means that you are launching 64 threads per block, and number of threadblocks equal to 1 more than needed to process pix3 elements. 意味着您正在每个块启动64个线程，并且线程块的数量比处理pix3元素所需的数量多1。 If pix3 is indeed 400, then you are processing 400 elements because you'll launch 7 threadblocks, each of which does 64 points, and 48 of which does nothing. 如果pix3确实是400，那么您正在处理400个元素，因为您将启动7个线程块，每个线程块执行64个点，而其中48个不执行任何操作。

I'm not too sure what's the problem here. 我不太确定这是什么问题。

Also, 也，

int i = blockIdx.x * blockDim.y + threadIdx.x * threadIdx.y;
int j = blockIdx.y * blockDim.x + threadIdx.x * threadIdx.y;

blocksize and nblocks are actually converted to dim3 vectors, so that they have a (x,y,z) value. 实际上，blocksize和nblocks被转换为dim3向量，因此它们具有（x，y，z）值。 If you call a kernel with <<64,7>>, that'll be translated to 如果用<< 64,7 >>调用内核，它将被转换为

dim3 blocksize(64,1,1);
dim3 nblocks(7,1,1);
kernel<<blocksize,nblocks>>();

so for each kernel call, the blockIdx has 3 components, the thread id x, y, and z, corresponding to the 3d grid of threads you are in. In your case, since you only have an x component, blockIdx.y and threadIdx.y are all going to be 1 no matter what. 因此对于每个内核调用，blockIdx都有3个组件，线程ID x，y和z，与您所在的线程的3d网格相对应。在您的情况下，由于只有x组件，所以blockIdx.y和threadIdx .y无论如何都将为1。 So essentially, they're useless. 因此从本质上讲，它们是无用的。

Honestly, you seem like you should go reread the basics of CUDA from the user manual, because there are a lot of basics you seem to be missing. 老实说，您似乎应该从用户手册中重新阅读CUDA的基础知识，因为您似乎缺少许多基础知识。 Explaining it here wouldn't be economical since it's all written down in a nice documentation you can get here . 在这里进行解释不是很经济，因为它们都写在一个不错的文档中，您可以在这里找到。 And if you just want to have a faster FFT with cuda, there's a number of libraries you can just download and install on Nvidia's CUDA zone that will do it for you if you don't care about learning CUDA. 而且，如果您只想使用cuda更快地进行FFT，则可以在Nvidia的CUDA区域中下载并安装许多库，如果您不关心学习CUDA，它们将为您提供帮助。

Best of luck mate. 祝你好运。

PS. PS。 you don't need to call cudaThreadSynchronize after each kernel ;) 您不需要在每个内核之后调用cudaThreadSynchronize;）

比较Matlab与CUDA的相关性并简化2D阵列

问题描述

1 个解决方案

解决方案1
4 2010-08-05 23:35:47

比较Matlab与CUDA的相关性并简化2D阵列

问题描述

1 个解决方案

解决方案1 4 2010-08-05 23:35:47

解决方案1
4 2010-08-05 23:35:47