简体   繁体   English

如何复制 cuda 中的二维数组?

[英]How to copy 2d array in cuda?

I am new to cuda and still trying to figure things out, so this question maybe dumb but I can't seem to figure out the problem so bare with me.我是 cuda 的新手,仍在尝试解决问题,所以这个问题可能很愚蠢,但我似乎无法弄清楚这个问题。

I am trying to copy a 2d array to the GPU. The size of the array is N*N (square array).我正在尝试将二维数组复制到 GPU。数组的大小为 N*N(方形数组)。 I'm trying to copy it using MallocPitch() & cudaMemcpy2D() .我正在尝试使用MallocPitch() & cudaMemcpy2D()复制它。 The problem is I seem to only be copying the first row of the array and nothing else.问题是我似乎只复制了数组的第一行,没有别的。 I can't find what exactly im doing wrong.我找不到我到底做错了什么。

My code:我的代码:

void function(){
   double A[N][N];
       //code to fill out the array.
 
   double* d_A;
   size_t pitch;
   cudaMallocPitch(&d_A, &pitch, N * sizeof(double), N);
   cudaMemcpy2D(d_A, pitch, A, N * sizeof(double) , N * sizeof(double), N, cudaMemcpyHostToDevice);

   int threadnum = 1;
   int blocksnum = 1; 
   
   kernal_print<<<blocknum, threadnum>>>(d_A, N); 
   
   //copying back to host & freeing up memory

}

__global__ void kernal_print(double* d_A, N){
   int xIdx = threadIdx.x + blockDim.x * blockIdx.x; 
   int yIdx = threadIdx.y + blockDim.y * blockIdx.y;

   printf("\n");
   for(int i = 0; i < N*N; i++){
       printf("%f, ",d_A[i]);
   }
   printf("\n");
}

The code above will only print the first row of whatever matrix I have.上面的代码只会打印我拥有的任何矩阵的第一行。 So for example a 3x3 matrix that looks like this:因此,例如一个 3x3 矩阵,如下所示:

1 2 3 1 2 3
4 5 6 4 5 6
7 8 9 7 8 9

the code will print (1 2 3 0 0 0 0 0 0)代码将打印 (1 2 3 0 0 0 0 0 0)

Any idea of what Iam doing wrong?知道我做错了什么吗? Thanks in advance!提前致谢!

This question may be useful for background. 这个问题可能对背景有用。

Perhaps you don't know what a pitched allocation is.也许您不知道什么是倾斜分配。 A pitched allocation looks like this:倾斜的分配看起来像这样:

X  X  X  P  P  P
X  X  X  P  P  P
X  X  X  P  P  P

The above could represent storage for a 3x3 array (elements represented by X ) that is pitched (pitched value of 6 elements, pitch "elements" represented by P ).上面可以表示一个 3x3 数组(由X表示的元素)的存储,该数组是倾斜的(6 个元素的倾斜值,由P表示的倾斜“元素”)。

You'll have no luck accessing such a storage arrangement if you don't follow the guidelines given in the reference manual for cudaMallocPitch .如果您不遵循cudaMallocPitch参考手册中给出的指南,您将无法访问这样的存储安排。 In-kernel access to such a pitched allocation should be done as follows:在内核中访问这样的倾斜分配应该按如下方式完成:

T* pElement = (T*)((char*)BaseAddress + Row * pitch) + Column;

You'll note that the above formula depends on the pitch value that was provided at the point of cudaMallocPitch .您会注意到上面的公式取决于在cudaMallocPitch点提供的pitch值。 If you don't pass that value to your kernel, you won't have any luck with this.如果您不将该值传递给您的 kernel,您将不会有任何运气。

Because you are not doing that, the proximal reason for your observation:因为您没有这样做,所以您观察的最直接原因是:

the code will print (1 2 3 0 0 0 0 0 0)代码将打印 (1 2 3 0 0 0 0 0 0)

is because your indexing is reading just the first "row" of that pitched allocation, and the P elements are showing up as zero (although that's not guaranteed.)是因为您的索引仅读取该倾斜分配的第一“行”,并且P元素显示为零(尽管不能保证。)

We can fix your code simply by implementing the suggestions given in the reference manual:我们可以简单地通过实施参考手册中给出的建议来修复您的代码:

$ cat t2153.cu
#include <cstdio>
const size_t N = 3;
__global__ void kernal_print(double* d_A, size_t my_N, size_t pitch){
//   int xIdx = threadIdx.x + blockDim.x * blockIdx.x;
//   int yIdx = threadIdx.y + blockDim.y * blockIdx.y;

   printf("\n");
   for(int row = 0; row < my_N; row++)
     for (int col = 0; col < my_N; col++){
       double* pElement = (double *)((char*)d_A + row * pitch) + col;
       printf("%f, ",*pElement);
     }
   printf("\n");
}

void function(){
   double A[N][N];
   for (size_t row = 0; row < N; row++)
     for (size_t col = 0; col < N; col++)
       A[row][col] = row*N+col+1;
   double* d_A;
   size_t pitch;
   cudaMallocPitch(&d_A, &pitch, N * sizeof(double), N);
   cudaMemcpy2D(d_A, pitch, A, N * sizeof(double) , N * sizeof(double), N, cudaMemcpyHostToDevice);

   int threadnum = 1;
   int blocknum = 1;

   kernal_print<<<blocknum, threadnum>>>(d_A, N, pitch);
   cudaDeviceSynchronize();
}

int main(){

  function();
}
$ nvcc -o t2153 t2153.cu
$ compute-sanitizer ./t2153
========= COMPUTE-SANITIZER

1.000000, 2.000000, 3.000000, 4.000000, 5.000000, 6.000000, 7.000000, 8.000000, 9.000000,
========= ERROR SUMMARY: 0 errors
$

A few comments:几点意见:

  • The usage of the term 2D can have varied interpretations.术语 2D 的使用可以有不同的解释。
  • Using a pitched allocation is not necessary for 2D work, and it may also have no practical value (not making your code simpler or more performant). 2D 工作不需要使用倾斜分配,而且它也可能没有实际价值(不会让您的代码更简单或更高效)。
  • For further discussion of the varied ways of doing "2D work", please read the answer I linked.有关进行“2D 工作”的各种方式的进一步讨论,请阅读我链接的答案。
  • This sort of allocation: double A[N][N];这种分配: double A[N][N]; may give you trouble for large N , because it is a stack-based allocation.对于大N可能会给您带来麻烦,因为它是基于堆栈的分配。 Instead, use a dynamic allocation (which may affect a number of the methods you use to handle it.) There are various questions covering this, such as this one .相反,使用动态分配(这可能会影响您用来处理它的许多方法。)对此有各种问题,例如这个

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM