矩阵矢量积CUDA性能

Question

I've found some code about a cuda matrix vector product in a previous topic : Matrix-vector multiplication in CUDA: benchmarking & performance I was firstly wondering why the author didn't used shared memory for dA (the matrix) ? 在上一个主题： CUDA中的矩阵向量乘法：基准测试和性能方面，我发现了一些有关cuda矩阵向量乘积的代码。我首先想知道为什么作者不对dA（矩阵）使用共享内存？

And then, why the column major ordering is faster than row major ordering ? 然后，为什么列主排序比行主排序快？

Here is the code : 这是代码：

    template<typename T>
__global__ void matvec_kernel(const T * __restrict__ dA, const T * __restrict__ dx, T * __restrict__ dy, const unsigned int nRows, const unsigned int nCols)
{
    const unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;

    __shared__ T x_shared[BLOCK_SIZE];

    T y_val = 0.0;

    #pragma unroll
    for (unsigned int m = 0; m < ((nCols + BLOCK_SIZE - 1)/ BLOCK_SIZE); ++m)
    {
        if ((m * BLOCK_SIZE + threadIdx.x) <  nCols) x_shared[threadIdx.x] = dx[threadIdx.x + m * BLOCK_SIZE];
        else                                         x_shared[threadIdx.x] = 0.f;
        __syncthreads();

        #pragma unroll
        for (unsigned int e = 0; e < BLOCK_SIZE; ++e) {
            // --- Column-major ordering - faster
            y_val += dA[tid + (e + BLOCK_SIZE * m) * nRows] * x_shared[e];
            // --- Row-major ordering - slower
            //y_val += dA[tid * nCols + (e + BLOCK_SIZE * m)] * x_shared[e];
        }

        __syncthreads();
    }

    if (tid < nRows) dy[tid] = y_val;

} }

I'm thinking on these two questions for 1 day now, and that's why i'm here. 我现在正在考虑这两个问题1天了，这就是为什么我在这里。

Thanks a lot ! 非常感谢！

Answer 1

Shared memory here works as a cache. 共享内存在这里用作缓存。 The components of the vector will be read multiple times, but the components of the matrix will be read only once during the calculation. 向量的成分将被读取多次，但是矩阵的成分在计算过程中将仅读取一次。 That's why the code only cache the vector but not the matrix. 这就是为什么代码只缓存向量而不缓存矩阵的原因。

Column-major matrix is faster because when reading the matrix, the threads are organized along the matrix columns. 以列为主的矩阵更快，因为在读取矩阵时，线程沿着矩阵列进行组织。 Col-major thus ensures the coalesced global memory access . 因此，col-major确保合并的全局内存访问。 If the matrix is row-major, the CUDA kernel should be implemented in a different way to achieve maximum performance. 如果矩阵是行优先的，则应以其他方式实现CUDA内核，以实现最佳性能。

矩阵矢量积CUDA性能

问题描述

1 个解决方案

解决方案1
1 2016-07-28 17:59:13

矩阵矢量积CUDA性能

问题描述

1 个解决方案

解决方案1 1 2016-07-28 17:59:13

解决方案1
1 2016-07-28 17:59:13