发送3D数组到CUDA内核

Question

I took the code given as an answer for How can I add up two 2d (pitched) arrays using nested for loops? 我将给出的代码作为答案，以了解如何使用嵌套的for循环添加两个2d（间距）数组？ and tried to use it for 3D instead of 2D and changed other parts slightly too, now it looks as follows: 并尝试将其用于3D而不是2D，并且还稍微更改了其他部分，现在看起来如下所示：

 __global__ void doSmth(int*** a) {
  for(int i=0; i<2; i++)
   for(int j=0; j<2; j++)
    for(int k=0; k<2; k++) 
     a[i][j][k]=i+j+k;
 }

 int main() {
  int*** h_c = (int***) malloc(2*sizeof(int**));
  for(int i=0; i<2; i++) {
   h_c[i] = (int**) malloc(2*sizeof(int*));
   for(int j=0; j<2; j++)
    GPUerrchk(cudaMalloc((void**)&h_c[i][j],2*sizeof(int)));
  }
  int*** d_c;
  GPUerrchk(cudaMalloc((void****)&d_c,2*sizeof(int**)));
  GPUerrchk(cudaMemcpy(d_c,h_c,2*sizeof(int**),cudaMemcpyHostToDevice));
  doSmth<<<1,1>>>(d_c);
  GPUerrchk(cudaPeekAtLastError());

  int res[2][2][2];
  for(int i=0; i<2; i++)
   for(int j=0; j<2; j++)
    GPUerrchk(cudaMemcpy(&res[i][j][0],
    h_c[i][j],2*sizeof(int),cudaMemcpyDeviceToHost));  

  for(int i=0; i<2; i++)
   for(int j=0; j<2; j++)
    for(int k=0; k<2; k++) 
     printf("[%d][%d][%d]=%d\n",i,j,k,res[i][j][k]);     
 }

In the above code I use 2 as sizes for each of the dimension of h_c, in the real implementation I will have these sizes in very large numbers and in different ones for every part of the subarrays of "int***" or more dimensions. 在上面的代码中，我为h_c的每个维度使用2作为大小，在实际的实现中，对于“ int ***”或更大维度的子数组的每个部分，我将拥有非常大的数目，而对于每个大小，它们将具有不同的大小。 I am getting problem with the part after the kernel call where I try to copy back the results to res array. 我在内核调用之后的部分出现问题，在该部分尝试将结果复制回res数组。 Can you help me fix the problem? 您能帮我解决问题吗？ Plz can you show solution in the way I am writing it above. 请您以我上面编写的方式显示解决方案。 Thanks! 谢谢！

Answer 1

First of all, I think talonmies when he posted the response to the previous question you mention, was not intending that to be representative of good coding. 首先，我认为当他发表对您提到的上一个问题的答复时，达隆密斯并不打算代表良好的编码。 So figuring out how to extend it to 3D might not be the best use of your time. 因此，弄清楚如何将其扩展到3D可能不是最好的时间。 For example, why do we want to write programs which use exactly one thread? 例如，为什么我们要编写仅使用一个线程的程序？ While there might be legitimate uses of such a kernel, this is not one of them. 尽管可能会合理使用这种内核，但这并不是其中之一。 Your kernel has the possibility to do a bunch of independent work in parallel , but instead you are forcing it all onto one thread, and serializing it. 您的内核可以并行执行一堆独立的工作，但是您可以将其全部强制到一个线程中并进行序列化。 The definition of the parallel work is: 并行工作的定义是：

a[i][j][k]=i+j+k;

Let's figure out how to handle that in parallel on the GPU. 让我们弄清楚如何在GPU上并行处理它。

Another introductory observation I would make is that since we are dealing with problems that have sizes that are known ahead of time, let's use C to tackle them with as much benefit as we can get from the language. 我要介绍的另一篇介绍性文章是，由于我们正在处理事先已知的大小问题，因此让我们使用C来解决它们，并从我们的语言中获得最大的收益。 Nested loops to do cudaMalloc may be needed in some cases, but I don't think this is one of them. 在某些情况下，可能需要使用嵌套循环来执行cudaMalloc，但我认为这不是其中之一。

Here's a code that accomplishes the work in parallel: 这是并行完成工作的代码：

#include <stdio.h>
#include <stdlib.h>
// set a 3D volume
// To compile it with nvcc execute: nvcc -O2 -o set3d set3d.cu
//define the data set size (cubic volume)
#define DATAXSIZE 100
#define DATAYSIZE 100
#define DATAZSIZE 20
//define the chunk sizes that each threadblock will work on
#define BLKXSIZE 32
#define BLKYSIZE 4
#define BLKZSIZE 4

// for cuda error checking
#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            return 1; \
        } \
    } while (0)

// device function to set the 3D volume
__global__ void set(int a[][DATAYSIZE][DATAXSIZE])
{
    unsigned idx = blockIdx.x*blockDim.x + threadIdx.x;
    unsigned idy = blockIdx.y*blockDim.y + threadIdx.y;
    unsigned idz = blockIdx.z*blockDim.z + threadIdx.z;
    if ((idx < (DATAXSIZE)) && (idy < (DATAYSIZE)) && (idz < (DATAZSIZE))){
      a[idz][idy][idx] = idz+idy+idx;
      }
}

int main(int argc, char *argv[])
{
    typedef int nRarray[DATAYSIZE][DATAXSIZE];
    const dim3 blockSize(BLKXSIZE, BLKYSIZE, BLKZSIZE);
    const dim3 gridSize(((DATAXSIZE+BLKXSIZE-1)/BLKXSIZE), ((DATAYSIZE+BLKYSIZE-1)/BLKYSIZE), ((DATAZSIZE+BLKZSIZE-1)/BLKZSIZE));
// overall data set sizes
    const int nx = DATAXSIZE;
    const int ny = DATAYSIZE;
    const int nz = DATAZSIZE;
// pointers for data set storage via malloc
    nRarray *c; // storage for result stored on host
    nRarray *d_c;  // storage for result computed on device
// allocate storage for data set
    if ((c = (nRarray *)malloc((nx*ny*nz)*sizeof(int))) == 0) {fprintf(stderr,"malloc1 Fail \n"); return 1;}
// allocate GPU device buffers
    cudaMalloc((void **) &d_c, (nx*ny*nz)*sizeof(int));
    cudaCheckErrors("Failed to allocate device buffer");
// compute result
    set<<<gridSize,blockSize>>>(d_c);
    cudaCheckErrors("Kernel launch failure");
// copy output data back to host

    cudaMemcpy(c, d_c, ((nx*ny*nz)*sizeof(int)), cudaMemcpyDeviceToHost);
    cudaCheckErrors("CUDA memcpy failure");
// and check for accuracy
    for (unsigned i=0; i<nz; i++)
      for (unsigned j=0; j<ny; j++)
        for (unsigned k=0; k<nx; k++)
          if (c[i][j][k] != (i+j+k)) {
            printf("Mismatch at x= %d, y= %d, z= %d  Host= %d, Device = %d\n", i, j, k, (i+j+k), c[i][j][k]);
            return 1;
            }
    printf("Results check!\n");
    free(c);
    cudaFree(d_c);
    cudaCheckErrors("cudaFree fail");
    return 0;
}

Since you've asked for it in the comments, here is the smallest number of changes I could make to your code to get it to work. 既然您在注释中已经要求过它，那么这里是我可以对您的代码进行更改以使其起作用的最少数量的更改。 Let's also remind ourselves of some of talonmies comments from the previous question you reference: 让我们还提醒您自己，您所引用的上一个问题中的一些标准论评论：

"For code complexity and performance reasons, you really don't want to do that, using arrays of pointers in CUDA code is both harder and slower than the alternative using linear memory." “出于代码复杂性和性能的原因，您真的不想这样做，因为与使用线性内存的替代方法相比，在CUDA代码中使用指针数组既困难又慢。”

"it is such a poor idea compared to using linear memory." “与使用线性内存相比，这是一个糟糕的主意。”

I had to diagram this out on paper to make sure I got all my pointer copying correct. 我必须在纸上画出图表，以确保所有指针复制正确。

#include <cstdio>
inline void GPUassert(cudaError_t code, char * file, int line, bool Abort=true)
{
    if (code != 0) {
        fprintf(stderr, "GPUassert: %s %s %d\n", cudaGetErrorString(code),file,line);
        if (Abort) exit(code);
    }
}

#define GPUerrchk(ans) { GPUassert((ans), __FILE__, __LINE__); }



 __global__ void doSmth(int*** a) {
  for(int i=0; i<2; i++)
   for(int j=0; j<2; j++)
    for(int k=0; k<2; k++)
     a[i][j][k]=i+j+k;
 }
 int main() {
  int*** h_c = (int***) malloc(2*sizeof(int**));
  for(int i=0; i<2; i++) {
   h_c[i] = (int**) malloc(2*sizeof(int*));
   for(int j=0; j<2; j++)
    GPUerrchk(cudaMalloc((void**)&h_c[i][j],2*sizeof(int)));
  }
  int ***h_c1 = (int ***) malloc(2*sizeof(int **));
  for (int i=0; i<2; i++){
    GPUerrchk(cudaMalloc((void***)&(h_c1[i]), 2*sizeof(int*)));
    GPUerrchk(cudaMemcpy(h_c1[i], h_c[i], 2*sizeof(int*), cudaMemcpyHostToDevice));
    }
  int*** d_c;
  GPUerrchk(cudaMalloc((void****)&d_c,2*sizeof(int**)));
  GPUerrchk(cudaMemcpy(d_c,h_c1,2*sizeof(int**),cudaMemcpyHostToDevice));
  doSmth<<<1,1>>>(d_c);
  GPUerrchk(cudaPeekAtLastError());
  int res[2][2][2];
  for(int i=0; i<2; i++)
   for(int j=0; j<2; j++)
    GPUerrchk(cudaMemcpy(&res[i][j][0], h_c[i][j],2*sizeof(int),cudaMemcpyDeviceToHost));

  for(int i=0; i<2; i++)
   for(int j=0; j<2; j++)
    for(int k=0; k<2; k++)
     printf("[%d][%d][%d]=%d\n",i,j,k,res[i][j][k]);
 }

In a nutshell, we have to do a successive sequence of: 简而言之，我们必须执行以下连续序列：

malloc a multidimensional array of pointers (on the host), one dimension less than the problem size, with the last dimension being a set of pointers to regions cudaMalloc'ed onto the device rather than the host. malloc（在主机上）分配一个多维数组的指针，比问题的大小小一个维度，最后一个维度是指向设备而不是主机上分配给cudaMalloc的区域的一组指针。
create another multidimensional array of pointers, of the same class as that created in the previous step, but one dimension less than that created in the previous step. 创建另一个多维数组指针，该数组与上一步中创建的类相同，但比上一步中创建的小一维。 this array must also have it's final ranks cudaMalloc'ed on the device. 该数组还必须在设备上具有cudaMalloc的最终排名。
copy the last set of host pointers from the second previous step into the area cudaMalloced on the device in the previous step. 将上一个第二步中的最后一组主机指针复制到上一步中设备上分配的cudaMalloc区域中。
repeat steps 2-3 until we end up with a single (host) pointer pointing to the multidimensional array of pointers, all of which are now resident on the device. 重复步骤2-3，直到最终得到指向多维数组指针的单个（主机）指针，所有这些指针现在都驻留在设备上。

发送3D数组到CUDA内核

问题描述

1 个解决方案

解决方案1
9 已采纳 2012-10-16 23:31:41

发送3D数组到CUDA内核

问题描述

1 个解决方案

解决方案1 9 已采纳 2012-10-16 23:31:41

解决方案1
9 已采纳 2012-10-16 23:31:41