使用CUDA进行矩阵处理

Question

I am trying to write a program for matrix calculations using C/CUDA. 我正在尝试编写一个使用C / CUDA进行矩阵计算的程序。 I have the following program: 我有以下程序：

In main.cu 在main.cu中

#include <cuda.h>
#include <iostream>
#include "teste.cuh"
using std::cout;

int main(void)
{
 const int Ndofs = 2;
 const int Nel   = 4;
 double *Gh   = new double[Ndofs*Nel*Ndofs*Nel];
 double *Gg;
 cudaMalloc((void**)& Gg, sizeof(double)*Ndofs*Nel*Ndofs*Nel);
 for (int ii = 0; ii < Ndofs*Nel*Ndofs*Nel; ii++)
  Gh[ii] = 0.;
 cudaMemcpy(Gh, Gg, sizeof(double)*Ndofs*Nel*Ndofs*Nel, cudaMemcpyHostToDevice);
 integraG<<<256, 256>>>(Nel, Gg);
 cudaMemcpy(Gg, Gh, sizeof(double)*Ndofs*Nel*Ndofs*Nel, cudaMemcpyDeviceToHost);
 for (int ii = 0; ii < Ndofs*Nel*Ndofs*Nel; ii++)
  cout << ii  + 1 << " " << Gh[ii] << "\n";
 return 0;
}

In mtrx.cuh 在mtrx.cuh中

#ifndef TESTE_CUH_
#define TESTE_CUH_

__global__ void integraG(const int N, double* G)
{

    const int szmodel = 2*N;
    int idx = threadIdx.x + blockIdx.x*blockDim.x;
    int idy = threadIdx.y + blockIdx.y*blockDim.y;
    int offset = idx + idy*blockDim.x*gridDim.x;
    int posInit = szmodel*offset;

    G[posInit + 0] = 1;
    G[posInit + 1] = 1;
    G[posInit + 2] = 1;
    G[posInit + 3] = 1;
}

#endif

The result (which is supposed to be a matrix filled with 1's) is copied back to the host array; 结果（应该是填充有1的矩阵）被复制回主机数组； The problem is: nothing happens! 问题是：什么也没发生！ Apparently, my program is not calling the gpu kernel, and I am still getting an array full of zeros. 显然，我的程序没有调用gpu内核，而且我仍然得到一个充满零的数组。

I am very new to CUDA programming and I am using CUDA by example (Jason Sanders) as a reference book. 我是CUDA编程的新手，我以CUDA为例（Jason Sanders）作为参考书。

My questions are: 我的问题是：

What is wrong with my code? 我的代码有什么问题？
Is this the best way to deal with matrices using GPU, using matrices vectorized form? 这是使用GPU，矩阵矢量化形式处理矩阵的最佳方法吗？
Is there another reference that can provide more examples on matrices using GPU's? 有没有其他参考可以提供有关使用GPU的矩阵的更多示例？

Answer 1

These are your questions: 这些是您的问题：

What is wrong with my code? 我的代码有什么问题？

Is this the best way to deal with matrices using GPU, using matrices vectorized form? 这是使用GPU，矩阵矢量化形式处理矩阵的最佳方法吗？

Is there another reference that can provide more examples on matrices using GPU's? 有没有其他参考可以提供有关使用GPU的矩阵的更多示例？

For your first question. 对于第一个问题。 First of all, your problem should explicitly be defined. 首先，应该明确定义您的问题。 What do you want to do with this code? 您想用这段代码做什么？ what sort of calculations do you want to do on the Matrix? 您要在矩阵上进行哪种计算？

Try to check for errors properly THIS is a very good way to do so. 尝试正确检查错误，这是一种非常好的方法。 There are some obvious bugs in your code as well. 您的代码中也有一些明显的错误。 some of your bugs: 您的一些错误：

You're passing the wrong address pointers to the cudaMemcpy, the pointers that are passed to the source and the destination have to be swapped with each other, Check here 您将错误的地址指针传递给cudaMemcpy，传递给源和目的地的指针必须互相交换，请点击此处

Change them to: 将它们更改为：

"Ndofs Nel Ndofs*Nel" shows that you're interested in the value of the first 64 numbers of the array, so why calling 256 Threads and 256 blocks? “ Ndofs Nel Ndofs * Nel”表明您对数组的前64个数字的值感兴趣，那么为什么要调用256个线程和256个块？
This part of your code: 您的代码的这一部分：
int idx = threadIdx.x + blockIdx.x blockDim.x; int idx = threadIdx.x + blockIdx.x blockDim.x; int idy = threadIdx.y + blockIdx.y blockDim.y; int idy = threadIdx.y + blockIdx.y blockDim.y;

shows that you want to use 2-Dim threads and blocks; 表明您要使用2-Dim线程和块； to do that so, you need to use Dim type. 为此，您需要使用Dim类型。

By making the following changes: 通过进行以下更改：

 cudaMemcpy(Gg, Gh, sizeof(double)*Ndofs*Nel*Ndofs*Nel, cudaMemcpyHostToDevice); //HERE
 dim3 block(2,2); //HERE
 dim3 thread(4,4); //HERE
 integraG<<<block, thread>>>(Nel, Gg); //HERE
 cudaMemcpy(Gh, Gg, sizeof(double)*Ndofs*Nel*Ndofs*Nel, cudaMemcpyDeviceToHost); //HERE

You'll get a result like the following: 您将得到如下结果：

Anyway, if you state your problem and goal more clearly, better suggestions can be provided for you. 无论如何，如果您更清楚地陈述自己的问题和目标，则可以为您提供更好的建议。

Regarding to your last two questions: 关于最后两个问题：

In my opinion CUDA C PROGRAMMING GUIDE and CUDA C BEST PRACTICES GUIDE are the two must documents to read when starting with CUDA, and they include examples on Matrix calculations as well. 在我看来， CUDA C编程指南和CUDA C最佳实践指南是从CUDA开始时必须阅读的两个必读文件，并且还包括有关Matrix计算的示例。

使用CUDA进行矩阵处理

问题描述

1 个解决方案

解决方案1
2 已采纳 2015-07-23 23:40:04

使用CUDA进行矩阵处理

问题描述

1 个解决方案

解决方案1 2 已采纳 2015-07-23 23:40:04

解决方案1
2 已采纳 2015-07-23 23:40:04