在Cuda中将3D阵列展平为1D

Question

I have the following code that I'm trying to implement in cuda but I'm having a problem of flattening a 3D array to 1D in cuda 我尝试在cuda中实现以下代码，但是在cuda中将3D数组展平为1D时遇到问题

C++ code C ++代码

for(int i=0; i<w; i++)
  for(int j=0; j<h; j++)
    for(int k=0; k<d; k++)
     arr[h*w*i+ w*j+ k] = (h*w*i+ w*j+ k)*2;

This is what I have so far in Cuda 这是我到目前为止在Cuda拥有的

  int w = h = d;
  int N = 64;

 __global__ void getIndex(float* A)
{
  int i = blockIdx.x;
  int j = blockIdx.y;
  int k = blockIdx.z;
  A[h*w*i+ w*j+ k] = h*w*i+ w*j+ k;
}


int main(int argc, char **argv)
 {

    float *d_A;
    cudaMalloc((void **)&d_A, w * h * d * sizeof(float) );
    getIndex <<<N,1>>> (d_A);
  }

But I'm not getting the result I'm expecting, I do not know how to get the right i,j and k indices 但是我没有得到期望的结果，我不知道如何获得正确的i,j和k索引

Answer 1

Consider a 3D problem of size w x h x d . 考虑尺寸的3D问题w X h X d 。 (This could be a simple array which has to be set like in your question or any other 3D problem that is easy to parallelize.) I will use your simple set-task for demonstration purpose. （这可能是一个简单的数组，必须像在您的问题或任何其他易于并行化的3D问题中一样进行设置。）我将使用您的简单设置任务进行演示。

The easiest way to handle this with a CUDA kernel is to launch one thread per array entry, that is w*h*d threads. 使用CUDA内核处理此问题的最简单方法是为每个数组条目启动一个线程，即w*h*d线程。 This answer discusses why one thread per element may not always be the best solution. 这个答案讨论了为什么每个元素一个线程可能并不总是最好的解决方案。

Now let us have a look at the following lines of code 现在让我们看一下下面的代码行

dim3 numThreads(w,h,d);
getIndex <<<1, numThreads>>> (d_A, w, h, d);

Here we are launching a kernel with a total of w*h*d threads. 在这里，我们正在启动一个总共有w*h*d线程的内核。 The kernel can than be implemented as 内核可以实现为

__global__ void getIndex(float* A, int w, int h, int d) // we actually do not need w
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    int k = threadIdx.z;
    A[h*d*i+ d*j+ k] = h*d*i+ d*j+ k;
}

But there is a problem with this kernel and the kernel call: The number of threads per thread block is limited (also the number of "threads in a specific direction" is bounded = the z direction is generally most bounded). 但是该内核和内核调用存在一个问题：每个线程块的线程数是有限的（“特定方向上的线程数”也有界= z方向通常是最大界）。 As we are only calling one thread block our problem size cannot be exceed these certain limits (eg w*h*d <= 1024 ). 因为我们只调用一个线程块，所以我们的问题大小不能超过这些特定限制（例如w*h*d <= 1024 ）。

This is what threadblocks are for. 这就是线程块的用途。 You can practically launch a kernel with as many threads as you want. 实际上，您可以根据需要启动具有多个线程的内核。 (This is not true but the limits for the maximal amount of threadblocks are not likely to be exhausted.) （这是不正确的，但是线程块最大数量的限制不太可能用尽。）

Calling the kernel this way: 以这种方式调用内核：

dim3 numBlocks(w/8,h/8,d/8);
dim3 numThreads(8,8,8);
getIndex <<<numBlocks, numThreads>>> (d_A, w, h, d);

will launch the kernel for w/8 * h/8 * d/8 thread blocks while every block contains 8*8*8 threads. 将为w/8 * h/8 * d/8线程块启动内核，而每个块包含8*8*8线程。 So in total w*h*d threads will be called. 因此总共将调用w*h*d线程。 Now we have to adjust our kernel accordingly: 现在我们必须相应地调整内核：

__global__ void getIndex(float* A, int w, int h, int d) // we actually do not need w
{
    int bx = blockIdx.x;
    int by = blockIdx.y;
    int bz = blockIdx.z;
    int tx = threadIdx.x;
    int ty = threadIdx.y;
    int tz = threadIdx.z;
    A[h*d*(8*bx + tx)+ d*(8*by + ty)+ (8*bz + tz)] = h*d*(8*bx + tx)+ d*(8*by + ty)+ (8*bz + tz);
}

Note: 注意：

You can write a more general kernel using blockDim.x instead of the fixed size 8 and gridDim.x to calculate w via gridDim.x*blockDim.x . 您可以使用blockDim.x而不是固定大小8和gridDim.x编写更通用的内核，以通过gridDim.x*blockDim.x计算w 。 The other two dimensions are handled likewise. 其他两个维度也同样处理。
In the proposed example all three dimensions w , h and d have to be multiples of 8. You can also generalize the kernel to allow every dimensions. 在建议的示例中，所有三个维度w ， h和d必须为8的倍数。您还可以泛化内核以允许每个维度。 (Then you have to parse all three dimensions to the kernel to check if the calculated position is still in range of the problem.) （然后，您必须将所有三个维度解析到内核，以检查计算出的位置是否仍在问题范围之内。）
As already mentioned, it may be more efficient to edit more than one entry of the array per thread. 如前所述，在每个线程中编辑数组的一个以上条目可能更为有效。 This again have to be considered when calling the kernel. 调用内核时必须再次考虑这一点。 A wrapper function which takes the problem size and the data and calls the kernel with the right block and thread configuration may be useful. 包装函数可以使用问题大小和数据，并使用正确的块和线程配置调用内核。

在Cuda中将3D阵列展平为1D

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-05-09 23:23:30

在Cuda中将3D阵列展平为1D

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-05-09 23:23:30

解决方案1
2 已采纳 2017-05-09 23:23:30