CUDA C矩阵乘法

Question

EDITED to correspond with current state after linked question. 编辑后与链接问题后的当前状态相对应。

I am currently trying to reimplement basic Matrix multiplication in CUDA, and while my code works fine for Square matrices, and Rectangular Matrices whose dimensions are multiples of 8, it does not appear to function for Rectangular Matrices, whose dimensions are not multiples of 8. 我目前正在尝试在CUDA中重新实现基本的矩阵乘法，虽然我的代码适用于Square矩阵和尺寸为8的倍数的矩形矩阵，但它似乎不适用于矩形矩阵，其尺寸不是8的倍数。

The following is my Kernel multiplication function: 以下是我的内核乘法函数：

 __global__ void matrixMultiply(float * A, float * B, float * C,
               int numARows, int numAColumns,
               int numBRows, int numBColumns,
               int numCRows, int numCColumns) {
    int Row = blockIdx.y * blockDim.y + threadIdx.y;
    int Col = blockIdx.x * blockDim.x + threadIdx.x;
    if (numAColumns != numBRows) return ;
    if ((Row < numARows) && (Col < numBColumns)){
        float Cvalue = 0;
        for (int k = 0 ; k < numAColumns ; ++k )
            Cvalue += A[Row*numAColumns + k] * B[k * numBColumns + Col];
        C[Row*numCColumns + Col] = Cvalue;
    }

}

The following is the memory allocation(for readability I have cut out the error checking): 以下是内存分配（为了便于阅读，我已经删除了错误检查）：

cudaMalloc((void**) &deviceA, ARows*sizeof(float)*AColumns);
cudaMalloc((void**) &deviceB, BRows*sizeof(float)*BColumns);
cudaMalloc((void**) &deviceC, CRows*sizeof(float)*CColumns);
cudaMemcpy(deviceA, hostA, ARows*sizeof(float)*AColumns, cudaMemcpyHostToDevice);
cudaMemcpy(deviceB, hostB, BRows*sizeof(float)*BColumns, cudaMemcpyHostToDevice);
cudaMemcpy(deviceC, hostC, CRows*sizeof(float)*CColumns, cudaMemcpyHostToDevice);

While the following is the Call: 以下是电话：

dim3 dimGrid((int)ceil(numCRows / 8.0) , (int)ceil(numCColumns / 8.0), 1);
dim3 dimBlock(8 , 8, 1);
multiplyMatrices<<<dimGrid,dimBlock>>>(deviceA, deviceB, deviceC, numARows, AColumns, BRows, BColumns, CRows, CColumns);

And finally moving the memory Back: cudaMemcpy(hostC, deviceC, CRows*sizeof(float)*CColumns, cudaMemcpyDeviceToHost); 最后移动内存返回：cudaMemcpy（hostC，deviceC，CRows * sizeof（float）* CColumns，cudaMemcpyDeviceToHost）;

Now I have traced my algorithm repeatedly, and I do not believe there to be anything wrong with it, so I personally think there might be something wrong with the Block/Grid sizing scheme I've used. 现在我已经反复跟踪我的算法了，我不相信它有任何问题，所以我个人认为我使用的Block / Grid大小调整方案可能有问题。 If anybody who knows CUDA/C better then I do (Ruby/JavaScript guy here), could take a look at it, and walk me through what exactly it is that I am doing wrong, I would be very very grateful. 如果有人比我更了解CUDA / C（这里是Ruby / JavaScript的人），可以看看它，并告诉我我做错了究竟是什么，我会非常感激。

Answer 1

The problem is with the grid size you are creating: 问题在于您正在创建的网格大小：

dim3 dimGrid((int)ceil(numCRows / 8.0) , (int)ceil(numCColumns / 8.0), 1);

As rows is the Y dimension of the matrix and columns is the X dimension, so you are actually creating the transposed grid. 由于行是矩阵的Y维度，而列是X维度，因此您实际上是在创建转置网格。

To create the correct grid, do the following: 要创建正确的网格，请执行以下操作：

dim3 dimGrid((int)ceil(numCColumns / 8.0) , (int)ceil(numCRows / 8.0), 1);

A better approach is to do the following: 更好的方法是执行以下操作：

dim3 dimGrid;

dimGrid.x = (numCColumns + dimBlock.x - 1)/dimBlock.x;

dimGrid.y = (numCRows + dimBlock.y - 1)/dimBlock.y;

CUDA C矩阵乘法

问题描述

1 个解决方案

解决方案1
3 已采纳 2012-12-23 17:30:50

CUDA C矩阵乘法

问题描述

1 个解决方案

解决方案1 3 已采纳 2012-12-23 17:30:50

解决方案1
3 已采纳 2012-12-23 17:30:50