使用CUDA的矩陣乘法

Question

我對CUDA上的矩陣乘法感到震驚。 結果乘積矩陣始終為零。 我已經閱讀了一些示例代碼，例如cuda中的矩陣乘法來解決我的問題，但全部都是徒勞的。

除了0的不穩定結果之外，“ Width”（下面的代碼）的最大大小甚至沒有達到512。我無法調試問題所在。 也許我們可以在StackOverflow上討論它。

我指的是“大規模並行處理器編程”

#include<cuda.h>
#include<stdio.h>

int main(void) {
    void MatrixMultiplication(float *, float *, float *, int);
    const int Width = 5;
    float M[Width*Width], N[Width*Width], P[Width*Width];
    for(int i = 0; i < (Width*Width) ; i++) {
        M[i] = 5;
        N[i] = 5;
        P[i] = 0;
    }
    MatrixMultiplication(M, N, P, Width);
    for(int i = 0; i < (Width*Width) ; i++) {
        printf("%d \n", P[i]);
    }
    int quit;
    scanf("%d",&quit);
    return 0;
}

//Matrix multiplication kernel - thread specification
__global__ void MatrixMulKernel(float *Md, float *Nd, float *Pd, int Width) {
    //2D Thread ID
    int tx = threadIdx.x;
    int ty = threadIdx.y;

    //Pvalue stores the Pd element that is computed by the thread
    float Pvalue = 0;

    for(int k = 0; k < Width ; ++k) {
        float Mdelement = Md[ty*Width + k];
        float Ndelement = Nd[k*Width + tx];
        Pvalue += (Mdelement*Ndelement);
    }

    Pd[ty*Width + tx] = Pvalue;
}

void MatrixMultiplication(float *M, float *N, float *P, int Width) {
    int size = Width*Width*sizeof(float);
    float *Md, *Nd, *Pd;

    //Transfer M and N to device memory
    cudaMalloc((void**)&Md, size);
    cudaMemcpy(Md,M,size,cudaMemcpyHostToDevice);
    cudaMalloc((void**)&Nd, size);
    cudaMemcpy(Nd,N,size,cudaMemcpyHostToDevice);

    //Allocate P on the device
    cudaMalloc((void**)&Pd,size);

    //Setup the execution configuration
    dim3 dimBlock(Width,Width);
    dim3 dimGrid(1,1);

    //Launch the device computation threads!
    MatrixMulKernel<<<dimGrid,dimBlock>>>(Md,Nd,Pd,Width);

    //Transfer P from device to host
    cudaMemcpy(P,Pd,size,cudaMemcpyDeviceToHost);

    //Free device matrices
    cudaFree(Md);
    cudaFree(Nd);
    cudaFree(Pd);
}

Answer 1

到目前為止，您的表現還不錯：

for(int i = 0; i < (Width*Width) ; i++) {
    printf("%d \n", P[i]);
}

我將其更改為％f（因為它是浮點數），它們的打印效果都很好:)

$ ./test.exe
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000

Answer 2

我弄清楚出了什么問題。 讓我們來分析一下：

要點1：尋求消除單調的“零值”

如前所述，您必須替換printf("%d \\n", P[i]); 作為printf("%f \\n", P[i]);

要點2：為什么程序會失敗，且寬度為512？

實際上，即使是很小的值（例如23），它也會失敗。為什么？ 因為23 * 23> 512（截至目前，每個塊GPU可以擁有的最大線程數！）

Answer 3

在您的MatrixMulKernel函數中，您的for循環就像

for(int k = 0; k < Width ; ++k) 
{
    //rest of code      
}

必須使用Width*Width而不是Width ，因為數組的大小為Width*Width 。

使用CUDA的矩陣乘法

問題描述

3 個解決方案

解決方案1
4 2011-02-17 14:54:35

解決方案2
1 已采納 2011-02-17 18:54:46

解決方案3
0 2011-02-16 21:04:08

使用CUDA的矩陣乘法

問題描述

3 個解決方案

解決方案1 4 2011-02-17 14:54:35

解決方案2 1 已采納 2011-02-17 18:54:46

解決方案3 0 2011-02-16 21:04:08

解決方案1
4 2011-02-17 14:54:35

解決方案2
1 已采納 2011-02-17 18:54:46

解決方案3
0 2011-02-16 21:04:08