用于矩阵加法的Cuda程序

Question

I am trying to make a very simple program in order to perform matrices addition. 我正在尝试制作一个非常简单的程序来执行矩阵加法。 I divided the code in two files, a main.cu file and a matrix.cuh header file. 我将代码分为两个文件，一个main.cu文件和一个matrix.cuh头文件。 The code is: 代码是：

At main.cu: 在main.cu：

#include <iostream>
#include <cuda.h>

#include "Matriz.cuh"

using std:: cout;

int main(void)
{

    Matriz A;
    Matriz B;
    Matriz *C = new Matriz;
    int lin = 10;
    int col = 10;

    A.lin = lin;
    A.col = col;
    B.lin = lin;
    B.col = col;
    C->lin = lin;
    C->col = col;
    C->matriz = new double[lin*col];

    A.matriz = new double[lin*col];
    B.matriz = new double[lin*col];

    for (int ii = 0; ii < lin; ii++)
        for (int jj = 0; jj < col; jj++)
        {
            A.matriz[jj*A.lin + ii] = 1./(float)(10.*jj + ii + 10.0);
            B.matriz[jj*B.lin + ii] = (float)(jj + ii + 1);
        }

    somaMatriz(A, B, C);

    for (int ii = 0; ii < lin; ii++)
    {
        for (int jj = 0; jj < col; jj++)
            cout << C->matriz[jj*C->lin + jj] << " ";
        cout << "\n";
    }

    return 0;

}

At matrix.cuh: 在matrix.cuh：

#include <cuda.h>
#include <iostream>
using std::cout;

#ifndef MATRIZ_CUH_
#define MATRIZ_CUH_

typedef struct{
    double *matriz;
    int    lin;
    int    col;
} Matriz;

__global__ void addMatrix(const Matriz A, const Matriz B, Matriz C)
{
    int idx = threadIdx.x + blockDim.x*gridDim.x;
    int idy = threadIdx.y + blockDim.y*gridDim.y;

    C.matriz[C.lin*idy + idx] = A.matriz[A.lin*idx + idy] + B.matriz[B.lin*idx + idy];
}

void somaMatriz(const Matriz A, const Matriz B, Matriz *C)
{
    Matriz dA;
    Matriz dB;
    Matriz dC;

    int BLOCK_SIZE = A.lin;

    dA.lin = A.lin;
    dA.col = A.col;
    dB.lin = B.lin;
    dB.col = B.col;
    dC.lin = C->lin;
    dC.col = C->col;

    cudaMalloc((void**)&dA.matriz, dA.lin*dA.col*sizeof(double));
    cudaMalloc((void**)&dB.matriz, dB.lin*dB.col*sizeof(double));
    cudaMalloc((void**)&dC.matriz, dC.lin*dC.col*sizeof(double));

    cudaMemcpy(dA.matriz, A.matriz, dA.lin*dA.col*sizeof(double), cudaMemcpyHostToDevice);
    cudaMemcpy(dB.matriz, B.matriz, dB.lin*dB.col*sizeof(double), cudaMemcpyHostToDevice);

    dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
    dim3 dimGrid(dA.lin/dimBlock.x, dA.col/dimBlock.y);

    addMatrix<<<dimGrid, dimBlock>>>(dA, dB, dC);

    cudaMemcpy(C->matriz, dC.matriz, dC.lin*dC.col*sizeof(double), cudaMemcpyDeviceToHost);
    cudaFree(dA.matriz);
    cudaFree(dB.matriz);
    cudaFree(dC.matriz);

   return;
}

#endif /* MATRIZ_CUH_ */

What I am getting: Matrix C is filled with ones, no matter what I do. 我得到的是：无论我做什么，矩阵C都充满了。 I am using this program to get an idea on how to work with matrices with variable size in a GPU program. 我正在使用此程序来了解如何在GPU程序中使用可变大小的矩阵。 What is wrong with my code? 我的代码有什么问题？

Answer 1

Any time you're having trouble with a CUDA code, it's good practice to do proper cuda error checking and run your code with cuda-memcheck . 每当您遇到CUDA代码问题时，最好执行正确的cuda错误检查并使用cuda-memcheck运行代码。 When I run your code with cuda-memcheck, I get the indication that the kernel is trying to do out-of-bounds read operations. 当我使用cuda-memcheck运行您的代码时，我得到指示，表明内核正在尝试进行越界读取操作。 Since your kernel is trivially simple, it means that your indexing calculations must be incorrect. 由于内核非常简单，因此这意味着索引计算必须不正确。

Your program needs at least 2 changes to get it working for small square matrices: 您的程序至少需要进行两项更改才能使其适用于小平方矩阵：

The index calculations in the kernel for A, B, and C should all be the same: 内核中A，B和C的索引计算应全部相同：

 C.matriz[C.lin*idy + idx] = A.matriz[A.lin*idx + idy] + B.matriz[B.lin*idx + idy];

like this: 像这样：

 C.matriz[C.lin*idy + idx] = A.matriz[A.lin*idy + idx] + B.matriz[B.lin*idy + idx];

Your x/y index creation in the kernel is not correct: 您在内核中创建的x / y索引不正确：

 int idx = threadIdx.x + blockDim.x*gridDim.x; int idy = threadIdx.y + blockDim.y*gridDim.y;

they should be: 他们应该是：

 int idx = threadIdx.x + blockDim.x*blockIdx.x; int idy = threadIdx.y + blockDim.y*blockIdx.y;

With the above changes, I was able to get rational looking output. 通过上述更改，我能够获得合理的外观输出。

Your setup code also doesn't appear to be handling larger matrices correctly: 您的设置代码似乎也无法正确处理较大的矩阵：

int BLOCK_SIZE = A.lin;
...
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid(dA.lin/dimBlock.x, dA.col/dimBlock.y);

You probably want something like: 您可能想要类似的东西：

int BLOCK_SIZE = 16;
...
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid((dA.lin + dimBlock.x - 1)/dimBlock.x, (dA.col + dimBlock.y -1)/dimBlock.y);

With those changes, you should add a valid thread check to your kernel, something like this: 进行这些更改后，您应该向内核添加有效的线程检查，如下所示：

__global__ void addMatrix(const Matriz A, const Matriz B, Matriz C)
{
    int idx = threadIdx.x + blockDim.x*blockIdx.x;
    int idy = threadIdx.y + blockDim.y*blockIdx.y;

    if ((idx < A.col) && (idy < A.lin))
      C.matriz[C.lin*idy + idx] = A.matriz[A.lin*idx + idy] + B.matriz[B.lin*idx + idy];
}

I also haven't validated that you are comparing all dimensions correctly against appropriate row or lin limits. 我也没有验证您是否将所有尺寸与适当的行或林限制进行了正确比较。 That is something else to verify for non-square matrices. 这是要验证非平方矩阵的其他方法。

用于矩阵加法的Cuda程序

问题描述

1 个解决方案

解决方案1
1 已采纳 2015-07-27 18:28:55

用于矩阵加法的Cuda程序

问题描述

1 个解决方案

解决方案1 1 已采纳 2015-07-27 18:28:55

解决方案1
1 已采纳 2015-07-27 18:28:55