具有共享内存的cuda平铺3d卷积实现

Question

Based on my study, there are 2 different strategies to implement tiled version of convolution with cuda. 根据我的研究，有两种使用cuda实现平铺式卷积版本的策略。 I want to know more about this, and would like to see how they compare with each other, what is the advantage and disadvantage of each strategy, and how to choose. 我想更多地了解这一点，并希望了解它们之间的比较，每种策略的优点和缺点以及如何选择。 Below is the implementations of the two different strategies. 以下是两种不同策略的实现。

Strategy 1: the tile size matches with the output size, and needs multiple steps to load the input. 策略1：图块大小与输出大小匹配，并且需要多个步骤来加载输入。

#define MASK_WIDTH 3
#define MASK_RADIUS 1

#define TILE_WIDTH 8

#define SHAREDMEM_DIM (TILE_WIDTH + (MASK_RADIUS * 2))

__constant__ float deviceMask[MASK_WIDTH * MASK_WIDTH * MASK_WIDTH];

__global__ void conv3d(float *inputArray, 
                   float *outputArray, 
                   const int z_size,
                   const int y_size, 
                   const int x_size) {
    __shared__ float subTile[SHAREDMEM_DIM][SHAREDMEM_DIM][SHAREDMEM_DIM];

    int bx = blockIdx.x, tx = threadIdx.x;
    int by = blockIdx.y, ty = threadIdx.y;
    int bz = blockIdx.z, tz = threadIdx.z;

    int destination = (tz * TILE_WIDTH * TILE_WIDTH) + (ty * TILE_WIDTH) + tx;
    int destTmp = destination;
    int dX = destTmp % SHAREDMEM_DIM;
    destTmp = destTmp / SHAREDMEM_DIM;
    int dY = destTmp % SHAREDMEM_DIM;
    destTmp = destTmp / SHAREDMEM_DIM;
    int dZ = destTmp;

    int inputZ = dZ + (bz * TILE_WIDTH) - MASK_RADIUS;
    int inputY = dY + (by * TILE_WIDTH) - MASK_RADIUS;
    int inputX = dX + (bx * TILE_WIDTH) - MASK_RADIUS;
    int input = (inputZ * y_size * x_size) + (inputY * x_size) + inputX;

    if(   inputZ >= 0 && inputZ < z_size 
       && inputY >= 0 && inputY < y_size 
       && inputX >= 0 && inputX < x_size){
           subTile[dZ][dY][dX] = inputArray[input];
    }
    else{
        subTile[dZ][dY][dX] = 0;
    }

    destination = TILE_WIDTH * TILE_WIDTH * TILE_WIDTH 
            + (tz * TILE_WIDTH * TILE_WIDTH) + (ty * TILE_WIDTH) + tx;
    destTmp = destination;
    dX = destTmp % SHAREDMEM_DIM;
    destTmp = destTmp / SHAREDMEM_DIM;
    dY = destTmp % SHAREDMEM_DIM;
    destTmp = destTmp / SHAREDMEM_DIM;
    dZ = destTmp;

    inputZ = dZ + (bz * TILE_WIDTH) - MASK_RADIUS;
    inputY = dY + (by * TILE_WIDTH) - MASK_RADIUS;
    inputX = dX + (bx * TILE_WIDTH) - MASK_RADIUS;
    input = (inputZ * y_size * x_size) + (inputY * x_size) + inputX;

    if(dZ < SHAREDMEM_DIM){
        if(   inputZ >= 0 && inputZ < z_size 
           && inputY >= 0 && inputY < y_size 
           && inputX >= 0 && inputX < x_size ) {
                subTile[dZ][dY][dX] = inputArray[input];
           }
        else{
            subTile[dZ][dY][dX] = 0;
        }
    }

    __syncthreads();  

    float sum = 0;
    int z, y, x;
    for(z = 0; z < MASK_WIDTH; z++){
        for(y = 0; y < MASK_WIDTH; y++){
            for(x = 0; x < MASK_WIDTH; x++){
                sum += subTile[tz + z][ty + y][tx + x] 
                   * deviceMask[x + (y * MASK_WIDTH) + (z * MASK_WIDTH * MASK_WIDTH)];
            }
        }
    }
    z = tz + (bz * TILE_WIDTH);
    y = ty + (by * TILE_WIDTH);
    x = tx + (bx * TILE_WIDTH);
    if(z < z_size && y < y_size && x < x_size){
        outputArray[x + (y * x_size) + (z * y_size * x_size)] = sum;
    }

    __syncthreads();
}

The second strategy is to set the block size to be the same with input tile. 第二种策略是将块大小设置为与输入图块相同。 In calculating output, some threads are turned off. 在计算输出时，一些线程被关闭。

#define TILE_X 14 
#define TILE_Y 6 
#define TILE_Z 6 
#define MASK_WIDTH 3
#define MASK_SIZE MASK_WIDTH * MASK_WIDTH * MASK_WIDTH
__constant__ float mask[MASK_WIDTH][MASK_WIDTH][MASK_WIDTH];
__global__ void conv3d(float *input, float *output, const int z_size, const int y_size, const int x_size) {
    __shared__ float inputTile [TILE_Z+MASK_WIDTH-1][TILE_Y+MASK_WIDTH-1][TILE_X+MASK_WIDTH-1];
    int tx = threadIdx.x; int ty = threadIdx.y; int tz = threadIdx.z;
    int bx = blockIdx.x; int by = blockIdx.y; int bz = blockIdx.z;

    int x_o = bx * TILE_X + tx
    int y_o = by * TILE_Y + ty;
    int z_o = bz * TILE_Z + tz;

    int x_i = x_o - MASK_WIDTH/2;
    int y_i = y_o - MASK_WIDTH/2;
    int z_i = z_o - MASK_WIDTH/2;
    if (x_i >= 0 && y_i >= 0 && z_i >= 0 && x_i < x_size && y_i < y_size && z_i < z_size)
        inputTile[tz][ty][tx] = input[(z_i * y_size + y_i) * x_size + x_i];
    else
        inputTile[tz][ty][tx] = 0.0;
    __syncthreads();
    float acc = 0.0;
    if(tz < TILE_Z && ty < TILE_Y && tx < TILE_X) {
        for(int z_mask = 0; z_mask < Z_MASK_WIDTH; z_mask++) {
            for(int y_mask = 0; y_mask < Y_MASK_WIDTH; y_mask++) {
                for(int x_mask = 0; x_mask < X_MASK_WIDTH; x_mask++) {
                    acc += mask[z_mask][y_mask][x_mask] * 
                           inputTile[tz+z_mask][ty+y_mask][tx+x_mask];
                }
             }
         }
    if(z_o < z_size && y_o < y_size && x_o < x_size)
        output[(z_o * y_size + y_o) * x_size + x_o] = acc;
    }
}

Any idea about how to choose between these? 关于如何在这些之间进行选择的任何想法？ In addition, which version is used more often in practice, like in deep learning? 此外，在实践中（如深度学习）更常使用哪个版本？ Also if you have any comments on the code, please also let me know! 另外，如果您对代码有任何意见，也请让我知道！

Answer 1

The general answer whenever it comes to the question of "which is faster?" 关于“哪个更快？”的一般答案。 is always: measure how fast each approach runs your application scenario to find out. 始终：测量每种方法运行您的应用程序场景的速度，以找出答案。 In this case, I would say that the first approach would seem preferable most of the time (if you had to pick one of those two options for some reason). 在这种情况下，我想说第一种方法在大多数情况下似乎更可取（如果您出于某种原因不得不选择这两种方法之一）。 Unless you have some very tiny convolution kernels, the second approach would have lots of threads idle in the parts that do much of the actual work. 除非您有一些非常小的卷积内核，否则第二种方法将在执行许多实际工作的部分中有许多空闲的线程。 Be sure to avoid bank conflicts within your tiles and think about the memory access patterns you get from your warps when moving data to and from global memory. 在将数据移入或移出全局内存时，请确保避免磁贴内的存储区冲突，并考虑一下从扭曲获得的内存访问模式。

In the end, convolution is basically just computing sums over all possible combinations of kernel coefficients and input elements. 最后，卷积基本上只是计算核系数和输入元素的所有可能组合的和。 Since the workload is essentially just repeatedly fetching these values in some order, convolution is almost necessarily going to be limited by bandwidth. 由于工作量基本上只是以某种顺序重复获取这些值，因此卷积几乎必然会受到带宽的限制。 Thus, doing convolution efficiently comes down to optimizing memory access and reducing bandwidth as much as possible. 因此，有效地进行卷积可归结为优化内存访问并尽可能减少带宽。

[…] which version is used more often in practice, like in deep learning? […]像深度学习那样在实践中更经常使用哪个版本？

Neither. 都不行 The naïve approach of throwing nested loops at it to brute-force convolution in the spatial domain is almost never an efficient way of computing convolutions. 在空间域中向其抛出嵌套循环以进行强力卷积的幼稚方法几乎从来都不是计算卷积的有效方法。 Convolution is such a fundamental operation for so many things that it has been studied extensively. 卷积是用于许多事物的基本操作，因此已被广泛研究。 There are literally hundreds, if not thousands of papers and books you could read on the subject. 实际上，您可以阅读数百本（甚至数千本）关于该主题的论文。 In deep learning, the problem of convolution has commonly been formulated in terms of general matrix multiplications (GEMMs) since this approach leads to rather nice memory access patterns and many efficient GEMM implementations are available for the GPU. 在深度学习中，卷积问题通常用通用矩阵乘法 （GEMM）来表述，因为这种方法会导致相当不错的内存访问模式，并且许多有效的GEMM实现均可用于GPU。 But also FFT-based approaches as well as other algorithms are increasingly used depending on the application. 但是，根据应用，也越来越多地使用基于FFT的方法以及其他算法。

具有共享内存的cuda平铺3d卷积实现

问题描述

1 个解决方案

解决方案1
2 2018-10-10 15:12:21

具有共享内存的cuda平铺3d卷积实现

问题描述

1 个解决方案

解决方案1 2 2018-10-10 15:12:21

解决方案1
2 2018-10-10 15:12:21