CUDA中的非方矩阵乘法

Question

For my GPU programming class, we've been tasked with completing certain parts of a non-square matrix multiplication program. 对于我的GPU编程类，我们的任务是完成非方矩阵乘法程序的某些部分。 Specifically, the kernel function and initializing the thread block and kernel grid dimensions. 具体来说，内核函数和初始化线程块和内核网格维度。

I've based my code on the CUDA C Programming Guide's matrix multiplication code, but instead of using structs as they do, I have modified mine to use only the parameters given (since we're not allowed to change parameters). 我的代码基于CUDA C编程指南的矩阵乘法代码，但不是像它们那样使用结构，而是修改了我只使用给定的参数（因为我们不允许更改参数）。 We are provided with the 3 matrices A, B, and C, as well as the dimensions of them- mxk, kxn, and mxn, respectively. 我们提供3个矩阵A，B和C，以及它们的尺寸-mxk，kxn和mxn。 Where the struct used A.height, I've used dimension m, where it used B.width, I've used dimension n, etc. 结构使用了A.height，我使用了维度m，它使用了B.width，我使用了维度n等。

I've run into several problems, the first of which is that my program doesn't pass the included test, which verifies the correctness of the product matrix C. I assume that there is something wrong in my matrix multiplication code, then, and that the issue probably arises from me adapting the struct code. 我遇到了几个问题，第一个问题是我的程序没有通过包含的测试，它验证了产品矩阵C的正确性。我认为我的矩阵乘法代码有问题，然后，这个问题可能源于我调整结构代码。

#include <stdio.h>
__global__ void mysgemm(int m, int n, int k, const float *A, const float *B,
        float* C) {

    /********************************************************************
     *
     * Compute C = A x B
     *   where A is a (m x k) matrix
     *   where B is a (k x n) matrix
     *   where C is a (m x n) matrix
     *
     ********************************************************************/

    // INSERT KERNEL CODE HERE
    // Each thread computes one element of C
    // by accumulating results into Cvalue
    float Cvalue = 0;
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    for (int e = 0; e < k; ++e){
        Cvalue += (A[row * k + e]) * (B[e * n + col]);
    }
    C[row * n + col] = Cvalue;
}

My other problem, which I'm even less sure about, involves the code to initialize the thread block and kernel grid dimensions. 我的另一个问题，我甚至不太确定，涉及初始化线程块和内核网格维度的代码。

// Initialize thread block and kernel grid dimensions ---------------------
    const unsigned int BLOCK_SIZE = 16; // Use 16x16 thread blocks
//INSERT CODE HERE
    dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
    dim3 dimGrid(n / dimBlock.x, m / dimBlock.y);
// Invoke CUDA kernel -----------------------------------------------------
//INSERT CODE HERE
    mysgemm<<<dimGrid, dimBlock>>>(m, n, k, A, B, C);

I understand dimBlock, but I don't understand dimGrid, and don't have a proper idea of what to use as parameters for it. 我理解dimBlock，但我不理解dimGrid，并且没有正确的想法如何使用它作为参数。 When I run the code as is, the kernel won't even launch if the matrix I pass in doesn't have a dimension that is a power of 2. And if I do use a power of 2, the test still fails. 当我按原样运行代码时，如果我传入的矩阵的维度不是2的幂，则内核甚至不会启动。如果我使用2的幂，则测试仍然失败。

I apologize if I've been too wordy. 如果我太过于罗嗦，我道歉。 This is my first post and I wanted to give as many details as possible. 这是我的第一篇文章，我想尽可能多地提供详细信息。 Hopefully someone can help walk me through these issues. 希望有人可以帮助我解决这些问题。

Answer 1

The following kernel I'm posting below is a variant of the one I posted in 我在下面发布的以下内核是我发布的内核的变体

CUDA: Tiled matrix-matrix multiplication with shared memory and matrix size which is non-multiple of the block size CUDA：平铺矩阵 - 矩阵乘法，共享内存和矩阵大小，是块大小的非倍数

in that it does not use shared memory. 因为它不使用共享内存。

__global__ void MatMulNoShared(float* A, float* B, float* C, int ARows, int ACols, int BRows, int BCols, int CRows, int CCols) {

    float CValue = 0;

    int Row = blockIdx.y*TILE_DIM + threadIdx.y;
    int Col = blockIdx.x*TILE_DIM + threadIdx.x;

    for (int k = 0; k < (TILE_DIM + ACols - 1)/TILE_DIM; k++) {

        for (int n = 0; n < TILE_DIM; ++n) 
            if ((k*TILE_DIM + n < ACols && Row < ARows) && (k*TILE_DIM + n < BRows && Col < BCols))
                CValue += A[Row*ACols + k*TILE_DIM + n] * B[(k*TILE_DIM + n)*BCols + Col];

    }

    if (Row < CRows && Col < CCols) C[((blockIdx.y * blockDim.y + threadIdx.y)*CCols)+(blockIdx.x*blockDim.x)+threadIdx.x]=CValue;
}

The two if statements in the kernel are the if statements mentioned in the answer by Eric. 内核中的两个if语句是Eric在答案中提到的if语句。

For the sake of your convenience, I'm posting the full code below: 为方便起见，我发布以下完整代码：

#include <stdio.h>
#include <math.h>
#include <conio.h>

#define TILE_DIM 16                     // Tile dimension
#define DIMX 373                            
#define DIMY 242
#define DIMZ 533

__global__ void MatMulNoShared(float* A, float* B, float* C, int ARows, int ACols, int BRows, int BCols, int CRows, int CCols) {

    float CValue = 0;

    int Row = blockIdx.y*TILE_DIM + threadIdx.y;
    int Col = blockIdx.x*TILE_DIM + threadIdx.x;

    for (int k = 0; k < (TILE_DIM + ACols - 1)/TILE_DIM; k++) {

        for (int n = 0; n < TILE_DIM; ++n) 
            if ((k*TILE_DIM + n < ACols && Row < ARows) && (k*TILE_DIM + n < BRows && Col < BCols))
                CValue += A[Row*ACols + k*TILE_DIM + n] * B[(k*TILE_DIM + n)*BCols + Col];

    }

    if (Row < CRows && Col < CCols) C[((blockIdx.y * blockDim.y + threadIdx.y)*CCols)+(blockIdx.x*blockDim.x)+threadIdx.x]=CValue;
}

int main() {

    int CCols = DIMZ, CRows=DIMX, ACols=DIMY, ARows=DIMX, BCols=DIMZ, BRows=DIMY;

    dim3 dimBlock(TILE_DIM, TILE_DIM, 1);
    dim3 dimGrid;

    dimGrid.x = (CCols + dimBlock.x - 1)/dimBlock.x;
    dimGrid.y = (CRows + dimBlock.y - 1)/dimBlock.y;

    float *deviceA, *deviceB, *deviceC;

    float* hostA    = (float*)malloc(DIMX*DIMY*sizeof(float));
    float* hostB    = (float*)malloc(DIMY*DIMZ*sizeof(float));
    float* hostC    = (float*)malloc(DIMX*DIMZ*sizeof(float));
    float* hostCp   = (float*)malloc(DIMX*DIMZ*sizeof(float));

    for (int x = 0; x<DIMX; x++)
        for (int y = 0; y<DIMY; y++) {
            hostA[x*DIMY+y] = rand()/(float)RAND_MAX;
            hostB[x*DIMY+y] = rand()/(float)RAND_MAX;
        }

    cudaMalloc((void **)&deviceA, DIMX*DIMY*sizeof(float));
    cudaMalloc((void **)&deviceB, DIMY*DIMZ*sizeof(float));
    cudaMalloc((void **)&deviceC, DIMX*DIMZ*sizeof(float));

    cudaMemcpy(deviceA, hostA, DIMX*DIMY*sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(deviceB, hostB, DIMY*DIMZ*sizeof(float), cudaMemcpyHostToDevice);

    MatMulNoShared<<<dimGrid , dimBlock>>>(deviceA , deviceB , deviceC , ARows , ACols, BRows ,BCols , CRows , CCols);

    cudaMemcpy(hostC, deviceC, DIMX*DIMZ*sizeof(float), cudaMemcpyDeviceToHost);

    return 0;
}

Note that the two instructions 注意这两条说明

    dimGrid.x = (CCols + dimBlock.x - 1)/dimBlock.x;
    dimGrid.y = (CRows + dimBlock.y - 1)/dimBlock.y;

ensure a full tiled coverage of the matrices, as mentioned at point 1. of Eric's answer. 确保矩阵的完整平铺覆盖，如Eric的答案第1点所述。

Answer 2

Your code currently only works when m and n are multiples of 16, which is your block size. 您的代码目前仅在m和n为16的倍数时才有效，这是您的块大小。

Two things you can do now to make it work on arbitrary sizes. 你现在可以做的两件事就是让它适用于任意尺寸。

Make the gird size large enough to cover the whole matrix C. Instead of using the floor of n/blockdim.x as you have done, you could use the ceil of that value by 使网格大小足以覆盖整个矩阵C.而不是像你所做的那样使用n / blockdim.x的地板，你可以使用该值的ceil
```
  (n+blockdim.x-1)/blockdim.x 
```
After you have done step 1, the matrix you are multiplying will be a little bit larger because of the ceiling operation. 完成步骤1后，由于天花板操作，您乘以的矩阵将略大一些。 you could then limit the multiplying to the exact size of the result matrix C by adding an if clause in the kernel. 然后，您可以通过在内核中添加if子句将乘法限制为结果矩阵C的确切大小。

Please refer to CUDA docs for more details, especially the programming guide. 有关更多详细信息，请参阅CUDA文档，尤其是编程指南。

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

CUDA中的非方矩阵乘法

问题描述

2 个解决方案

解决方案1
3 2013-09-25 14:03:19

解决方案2
1 2013-09-25 07:22:37

CUDA中的非方矩阵乘法

问题描述

2 个解决方案

解决方案1 3 2013-09-25 14:03:19

解决方案2 1 2013-09-25 07:22:37

解决方案1
3 2013-09-25 14:03:19

解决方案2
1 2013-09-25 07:22:37