简体   繁体   English

使用CUDA的矩阵乘法

[英]Matrix Multiplication using CUDA

I am struck up with Matrix multiplication on CUDA. 我对CUDA上的矩阵乘法感到震惊。 The resultant product matrix is always zero. 结果乘积矩阵始终为零。 I have read some sample codes like matrix multiplication in cuda for resolving my problem, but all in vain. 我已经阅读了一些示例代码,例如cuda中的矩阵乘法来解决我的问题,但全部都是徒劳的。

Apart from erratic result of 0, the maximum size of "Width" (code below) is not even 512. I was not able to debug where the problem lies. 除了0的不稳定结果之外,“ Width”(下面的代码)的最大大小甚至没有达到512。我无法调试问题所在。 May be we can discuss it on StackOverflow. 也许我们可以在StackOverflow上讨论它。

I am referring "Programming Massively Parallel Processors" 我指的是“大规模并行处理器编程”

#include<cuda.h>
#include<stdio.h>

int main(void) {
    void MatrixMultiplication(float *, float *, float *, int);
    const int Width = 5;
    float M[Width*Width], N[Width*Width], P[Width*Width];
    for(int i = 0; i < (Width*Width) ; i++) {
        M[i] = 5;
        N[i] = 5;
        P[i] = 0;
    }
    MatrixMultiplication(M, N, P, Width);
    for(int i = 0; i < (Width*Width) ; i++) {
        printf("%d \n", P[i]);
    }
    int quit;
    scanf("%d",&quit);
    return 0;
}

//Matrix multiplication kernel - thread specification
__global__ void MatrixMulKernel(float *Md, float *Nd, float *Pd, int Width) {
    //2D Thread ID
    int tx = threadIdx.x;
    int ty = threadIdx.y;

    //Pvalue stores the Pd element that is computed by the thread
    float Pvalue = 0;

    for(int k = 0; k < Width ; ++k) {
        float Mdelement = Md[ty*Width + k];
        float Ndelement = Nd[k*Width + tx];
        Pvalue += (Mdelement*Ndelement);
    }

    Pd[ty*Width + tx] = Pvalue;
}

void MatrixMultiplication(float *M, float *N, float *P, int Width) {
    int size = Width*Width*sizeof(float);
    float *Md, *Nd, *Pd;

    //Transfer M and N to device memory
    cudaMalloc((void**)&Md, size);
    cudaMemcpy(Md,M,size,cudaMemcpyHostToDevice);
    cudaMalloc((void**)&Nd, size);
    cudaMemcpy(Nd,N,size,cudaMemcpyHostToDevice);

    //Allocate P on the device
    cudaMalloc((void**)&Pd,size);

    //Setup the execution configuration
    dim3 dimBlock(Width,Width);
    dim3 dimGrid(1,1);

    //Launch the device computation threads!
    MatrixMulKernel<<<dimGrid,dimBlock>>>(Md,Nd,Pd,Width);

    //Transfer P from device to host
    cudaMemcpy(P,Pd,size,cudaMemcpyDeviceToHost);

    //Free device matrices
    cudaFree(Md);
    cudaFree(Nd);
    cudaFree(Pd);
}

You were doing fine until this point: 到目前为止,您的表现还不错:

for(int i = 0; i < (Width*Width) ; i++) {
    printf("%d \n", P[i]);
}

I changed it to %f (because it's a float) and they all print nicely :) 我将其更改为%f(因为它是浮点数),它们的打印效果都很好:)

$ ./test.exe
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000

I figured out what was wrong. 我弄清楚出了什么问题。 Let's analyze it : 让我们来分析一下:

Point 1 : The quest to remove the ever monotonic "zero value" 要点1:寻求消除单调的“零值”

As noted, you must replace printf("%d \\n", P[i]); 如前所述,您必须替换printf("%d \\n", P[i]); as printf("%f \\n", P[i]); 作为printf("%f \\n", P[i]);

Point 2 : Why the program fails with a value of Width 512 ? 要点2:为什么程序会失败,且宽度为512?

Actually it will fail for even a small value such as 23. Why ? 实际上,即使是很小的值(例如23),它也会失败。为什么? Because 23*23 is > 512 (The maximum number of threads that a GPU can have per block as of today!) 因为23 * 23> 512(截至目前,每个块GPU可以拥有的最大线程数!)

In your MatrixMulKernel function your for loop is like 在您的MatrixMulKernel函数中,您的for循环就像

for(int k = 0; k < Width ; ++k) 
{
    //rest of code      
}

Instead of Width , you must use Width*Width as your array is of size Width*Width . 必须使用Width*Width而不是Width ,因为数组的大小为Width*Width

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM