CUDA：使用网格扩展循环减少共享内存

Question

I have the following question concerning usage of grid-strided loops and optimized reduction algorithms in shared memory together in CUDA kernels. 关于CUDA内核中共享内存中网格跨越循环和优化归约算法的用法，我有以下问题。 Imagine that you have 1D array with number of element more than threads in the grid (BLOCK_SIZE * GRID_SIZE). 想象一下，您拥有一维数组，其元素数多于网格中的线程（BLOCK_SIZE * GRID_SIZE）。 In this case you will write the kernel of this kind: 在这种情况下，您将编写这种内核：

#define BLOCK_SIZE (8)
#define GRID_SIZE (8)
#define N (2000)

// ...

__global__ void gridStridedLoop_kernel(double *global_1D_array)
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    int i;

    // N is a total number of elements in the global_1D_array array
    for (i = idx; i < N; i += blockDim.x * gridDim.x)
    {
        // Do smth...
    }
}

Now you want to look for maximum element in the global_1D_array using reduction in shared memory and the above kernel will be look like this one: 现在，您要使用减少共享内存的方式在global_1D_array查找最大元素，以上内核将如下所示：

#define BLOCK_SIZE (8)
#define GRID_SIZE (8)
#define N (2000)

// ...

__global__ void gridStridedLoop_kernel(double *global_1D_array)
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    int i;

    // Initialize shared memory array for the each block
    __shared__ double data[BLOCK_SIZE];

    // N is a total number of elements in the global_1D_array array
    for (i = idx; i < N; i += blockDim.x * gridDim.x)
    {
        // Load data from global to shared memory
        data[threadIdx.x] = global_1D_array[i];
        __syncthreads();

        // Do reduction in shared memory ...
    }

    // Copy MAX value for each block into global memory
}

It is clear that some values in the data will be overwritten, ie you need longer shared memory array or have to organize the kernel in another way. 显然， data中的某些值将被覆盖，即您需要更长的共享内存阵列或必须以其他方式组织内核。 What is the best (most efficient) way to use reduction in shared memory and strided loop together? 共同使用减少共享内存和跨步循环的最佳（最有效）方法是什么？

Thanks in advance. 提前致谢。

Answer 1

A reduction using a grid strided loop is documented here . 此处记录了使用网格大跨度的减少。 Referring to slide 32, the grid-strided loop looks like this: 参考幻灯片32，网格扩展的循环如下所示：

unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockSize*2) + threadIdx.x;
unsigned int gridSize = blockSize*2*gridDim.x;
sdata[tid] = 0;
while (i < n){
  sdata[tid] += g_idata[i] + g_idata[i+blockSize];
  i += gridSize;
  }
__syncthreads();

Note that each iteration of the while-loop increases the index by gridSize , and this while-loop will continue until the index ( i ) exceeds the (global) data size ( n ). 请注意，while循环的每次迭代都会通过gridSize增加索引，并且该while循环将继续进行，直到索引（ i ）超过（全局）数据大小（ n ）。 We call this a grid-strided loop. 我们称其为网格条纹循环。 In this example, the remainder of the threadblock-local reduction operation is not impacted by grid-size looping, thus only the "front-end" is shown. 在此示例中，线程块局部缩减操作的其余部分不受网格大小循环的影响，因此仅显示了“前端”。 This particular reduction is doing a sum-reduction, but a max-reduction would simply replace the operation with something like: 这种特殊的减少是进行总和减少，但是最大减少将简单地用以下内容代替操作：

unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockSize + threadIdx.x;
unsigned int gridSize = blockSize*gridDim.x;
sdata[tid] = 0;
while (i < n){
  sdata[tid] = (sdata[tid] < g_idata[i]) ? + g_idata[i]:sdata[tid];
  i += gridSize;
  }
__syncthreads();

And the remainder of the threadblock level reduction would have to be modified in a similar fashion, replacing the summing operation with a max-finding operation. 并且必须以类似的方式修改线程块级别减少的其余部分，将求和操作替换为最大查找操作。

The full parallel reduction CUDA sample code is available as part of any full cuda samples install, or here . 完整的并行精简CUDA示例代码可在任何完整的cuda示例安装中找到，或在此处获取。

CUDA：使用网格扩展循环减少共享内存

问题描述

1 个解决方案

解决方案1
4 已采纳 2014-08-11 01:47:33

CUDA：使用网格扩展循环减少共享内存

问题描述

1 个解决方案

解决方案1 4 已采纳 2014-08-11 01:47:33

解决方案1
4 已采纳 2014-08-11 01:47:33