CUDA中更快的矩阵乘法

Question

Currently, I made a neural networks program in the cuda c. 目前，我在cuda中编写了一个神经网络程序。 Because I needed to manipulate the matrix multiplication, I did not use CUBLAS for MM. 因为我需要处理矩阵乘法，所以我没有对MM使用CUBLAS。 I use the following code for MM. 我将以下代码用于MM。 I was wondering if any one has some advice to make it faster which can be very helpful since I need to use MM millions of times during learning. 我想知道是否有人建议加快速度，这对我很有帮助，因为我在学习过程中需要使用MM数百万次。 Thanks. 谢谢。 This is the MakeFile: 这是MakeFile：

# cuda root
_CUDA_ROOT_ = /usr/local/cuda

NVCC = nvcc
# include and lib paths
INCLUDES=-I${_CUDA_ROOT_}/include
LIB_PATH=-L${_CUDA_ROOT_}/lib64

# libraries to link against
LIB= -lcudart -lcublas
CU_SRC= main.cu
EXE=$(CU_SRC:.cu=)
#------------------------------
# Choose your gpu arch
SM = sm_35
all: $(EXE)
$(EXE): $(CU_SRC)
        $(NVCC) -arch $(SM) $(CU_SRC) -o $(EXE) $(LIB_PATH) $(LIB)

clean:
        rm -f *.o *.cu_o $(EXE)

This is the MM code: 这是MM代码：

__global__
void matrixMulti(float* A_d, float* B_d, float* C_d, int m, int k, int n)
{
    __shared__ float ds_A[TILE_WIDTH][TILE_WIDTH];
    __shared__ float ds_B[TILE_WIDTH][TILE_WIDTH];
    int col = blockIdx.x*blockDim.x + threadIdx.x;
    int row = blockIdx.y*blockDim.y + threadIdx.y;
    int tx = threadIdx.x;
    int ty = threadIdx.y;
    float sum = 0;

    for(int t=0; t<(n-1)/TILE_WIDTH+1; t++)
    {
        if(row<m && t*TILE_WIDTH+tx<n)
            ds_A[ty][tx] = A_d[row*n + t*TILE_WIDTH+tx];
        else
            ds_A[ty][tx] = 0.0;
        if(t*TILE_WIDTH+ty<n && col<k)
            ds_B[ty][tx] = B_d[(t*TILE_WIDTH+ty)*k + col];
        else
            ds_B[ty][tx] = 0.0;
        __syncthreads();
        for(int i=0; i<TILE_WIDTH; i++)
            sum += ds_A[ty][i] * ds_B[i][tx];
        __syncthreads();
    }
    if(row<m && col<k)
        C_d[col+row*k] = sum;
}

This is the example of main part of the code: 这是代码主要部分的示例：

const int TILE_WIDTH = 32;

int main()
{
    int m, k, n;
    m = 10000, k = 10000, n = 10000;
    float *A, *B, *C;
    A = new float[m*n];
    B = new float[n*k];
    C = new float[m*k];
    float *A_d, *B_d, *C_d;
    for (int i=0; i<m*n; i++)
    {
        A[i] = 2;
    }
    for (int i=0; i<n*k; i++)
    {
        B[i] = 3;
    }
    cudaMalloc(&A_d, sizeof(float)*m*n);
    cudaMalloc(&B_d, sizeof(float)*n*k);
    cudaMalloc(&C_d, sizeof(float)*m*k);
    cudaMemcpy(A_d, A, sizeof(float)*m*n, cudaMemcpyHostToDevice);
    cudaMemcpy(B_d, B, sizeof(float)*k*n, cudaMemcpyHostToDevice);
    dim3 dimGrid((k-1)/TILE_WIDTH+1, (m-1)/TILE_WIDTH+1, 1);
    dim3 dimBlock(TILE_WIDTH, TILE_WIDTH, 1);
    matrixMulti<<<dimGrid,dimBlock>>>(A_d, B_d, C_d, m, k, n);
    cudaMemcpy(C, C_d, sizeof(float)*m*k, cudaMemcpyDeviceToHost);
    return 0;
}

Answer 1

Firstly, be really sure this is what you want to do. 首先，请确保这是您要执行的操作。 Without describing the manipulations you want to do, it's hard to comment on this, but be aware that matrix multiplication is an n-cubed operation. 在不描述您要执行的操作的情况下，很难对此进行评论，但是请注意，矩阵乘法是n立方运算。 If your manipulations are not the same complexity, chances are you'll do better simply using cuBLAS. 如果您的操作复杂度不一样，那么使用cuBLAS可能会做得更好。

Why is this? 为什么是这样？ cuBLAS will probably be faster than anything you'll write, and will be much more maintainable as it will follow new GPU architectures. cuBLAS可能会比您将要编写的任何东西都要快，并且将遵循新的GPU架构，因此将具有更高的可维护性。 The best implementation of something like GEMM will vary based on architecture, so any code you're writing now for your hardware will have to be re-optimised for new hardware. 诸如GEMM之类的最佳实现会因体系结构而异，因此，您现在为硬件编写的任何代码都必须针对新硬件进行重新优化。

Now, to the question. 现在，到这个问题。 There's a number of techniques you should consider to optimise this code: 您应该考虑多种技术来优化此代码：

Compute multiple output values per thread . 每个线程计算多个输出值 。 This reduces the pressure on your shared memory as tile data can be used in multiple calculations. 由于可以在多个计算中使用切片数据，因此可以减轻共享内存的压力。
Fix the bank conflicts in shared memory . 修复共享内存中的库冲突 。 This should be covered well by the documentation. 文档应对此进行很好的介绍。
Vectorise shared memory loads and stores . 向量化共享内存的加载和存储 。 I notice you're compiling for sm_35. 我注意到您正在为sm_35进行编译。 This architecture's shared memory banks each have a bandwidth of 64 bits/clock. 该体系结构的共享存储区每个都有64位/时钟的带宽。 Loading a single float is only 32 bits, so you won't get full bandwidth on floats without vectorization. 加载单个浮点数只有32位，因此，如果不进行矢量化处理，则不会获得浮点数的全部带宽。 You should look at float2/float4 types. 您应该查看float2 / float4类型。
Consider double buffering . 考虑双重缓冲 。 Load data into one shared memory tile while operating on another. 在另一个数据上进行操作时，将数据加载到一个共享内存中。 This allows the high latency of global memory operations to be hidden much more effectively, reduces the synchronisation overhead, and often tends to perform better. 这样可以更有效地隐藏全局存储器操作的高延迟，减少同步开销，并且往往表现得更好。 It uses twice as much shared memory though, as you need two tiles at once. 但是，它使用两倍的共享内存，因为您一次需要两个磁贴。

There are a number of papers on the implementation of matrix multiplication on GPUs, I suggest you check them out. 关于在GPU上实现矩阵乘法的论文很多，我建议您将它们检出。 You'll get a lot more detail from these papers than you will asking broad questions on SO. 从这些论文中获得的细节比向SO提出广泛的问题要多得多。

Finally... are you sure you don't want to use cuBLAS? 最后...您确定不想使用cuBLAS吗？ I wouldn't count on getting 75% of cuBLAS performance, and even that will be a challenge. 我不会指望获得cuBLAS性能的75％，即使那样也将是一个挑战。

CUDA中更快的矩阵乘法

问题描述

1 个解决方案

解决方案1
6 已采纳 2015-06-08 14:46:29

CUDA中更快的矩阵乘法

问题描述

1 个解决方案

解决方案1 6 已采纳 2015-06-08 14:46:29

解决方案1
6 已采纳 2015-06-08 14:46:29