简体   繁体   English

Cuda进行矩阵乘法

[英]Cuda to make Matrix Multiplication

have a problem making a Matrix Multiplication using cuda. 使用cuda进行矩阵乘法时遇到问题。 I have to do A*A*A*A and save it in hB. 我必须做A * A * A * A并将其保存在hB中。 With Cublas it's ok, but I can't make it with CUDA. 使用Cublas可以,但是我不能使用CUDA。 Dimension can be a high value like 2000. This is my code: Dimension可能是一个很高的值,例如2000。这是我的代码:

__global__ void CudaMM(float *A, float *B, int N)
{

    int row = blockIdx.y*blockDim.y + threadIdx.y;
    int col = blockIdx.x*blockDim.x + threadIdx.x;

    float sum = 0.f;
    for (int n = 0; n < N; ++n)
        sum += A[row*N+n]*A[n*N+col];

    B[row*N+col] = sum;
}

void CudaMult(int dimension,float *hMatrice,float *hB,float *d_A,float *d_B){
    int N,K;
    K = 100;            
    N = K*BLOCK_SIZE;

    dim3 threadBlock(BLOCK_SIZE,BLOCK_SIZE);
    dim3 grid(K,K);

    cudaMemcpy(d_A,hMatrice,dimension*dimension*sizeof(float),cudaMemcpyHostToDevice);

CudaMM<<<grid,threadBlock>>>(d_A,d_B,N);

cudaMemcpy(hB,d_B,dimension*dimension*sizeof(float),cudaMemcpyDeviceToHost);


}

void CublasFindConnect(int dimension,float* mat,float* B){


    float *d_A,*d_B;
    cudaMalloc(&d_A,dimension*dimension*sizeof(float));
    cudaMalloc(&d_B,dimension*dimension*sizeof(float));

    int w=0;
    while(w<5){

        CudaMult(dimension,mat,B,d_A,d_B);

          // Copy Matrix computed B to previous M

            for (m=0; m<dimension; m++) {

                for (n=0; n<dimension; n++) {
                    mat[m*dimension+n]=B[m*dimension+n];
                    B[m*dimension+n]=0;
                }
            }

     w++;
    }

cudaFree(d_A);
cudaFree(d_B);

}

I've installed last CUDA 6 that it doesn't require cudaMemCpy, because memory is shared. 我安装了最新的CUDA 6,它不需要cudaMemCpy,因为共享内存。

  • I would suggest you start by doing proper cuda error checking on the code you have shown, and see what results you get. 我建议您首先对显示的代码进行正确的cuda错误检查 ,然后查看得到的结果。
  • It will be better if you show a complete code as well. 如果您还显示完整的代码,那就更好了。 For example what is BLOCK_SIZE ? 例如,什么是BLOCK_SIZE The idea is not to tell me what BLOCK_SIZE is, but to show a complete code. 这个想法不是要告诉我BLOCK_SIZE是什么,而是要显示完整的代码。
  • As an aside, the feature you are referring to in CUDA 6 has specific requirements (such as the use of cudaMallocManaged() ) that you're not meeting, but nevertheless your code is not dependent on Unified Memory, so it's irrelevant. cudaMallocManaged() ,您在CUDA 6中引用的功能具有您没有满足的特定要求(例如使用cudaMallocManaged() ),但是您的代码并不依赖于统一内存,因此无关紧要。

One problem I can see in your code is that your dimension variable is arbitrary (you say it can be up to a large number like 2000) but your computation size is fixed at N=K*BLOCK_SIZE; 我在您的代码中看到的一个问题是您的dimension变量是任意的(您可以说它最多可以是2000),但是您的计算大小固定为N=K*BLOCK_SIZE; . Presumably if your BLOCK_SIZE is some value like 16 or 32, then it will meet your approximate max dimension size of ~2000. 大概,如果您的BLOCK_SIZE是某个值,例如16或32,那么它将满足您大约2000的最大dimension

The problem arises because your grid size is potentially larger than your valid array size. 出现问题是因为您的网格大小可能大于有效数组的大小。 You are launching an N x N grid, but N can be larger than dimension . 您正在启动N x N网格,但是N可以大于dimension This means some of the launched threads can attempt to access the matrices ( A and B ) outside of their valid dimensions. 这意味着某些启动的线程可以尝试在其有效维之外访问矩阵( AB )。

You can fix this with a "thread check" in your kernel, something like this: 您可以通过内核中的“线程检查”来解决此问题,如下所示:

__global__ void CudaMM(float *A, float *B, int N)
{

    int row = blockIdx.y*blockDim.y + threadIdx.y;
    int col = blockIdx.x*blockDim.x + threadIdx.x;

    if ((row < N) && (col < N)) {

      float sum = 0.f;
      for (int n = 0; n < N; ++n)
        sum += A[row*N+n]*A[n*N+col];

      B[row*N+col] = sum;
    }
}

and you will need to modify your kernel invocation to: 并且您需要将内核调用修改为:

CudaMM<<<grid,threadBlock>>>(d_A,d_B,dimension);

You might also want to consider choosing grid sizes based on your actual dimension , rather than fixed at 100*BLOCK_SIZE , but that is not essential to get the code to work. 您可能还需要考虑根据实际dimension选择网格大小,而不是固定为100*BLOCK_SIZE ,但这对于使代码正常工作不是必需的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM