CUDA c ++，简单矩阵乘法错误

Question

I am quite new at CUDA programming with c++, so sorry for this simple question. 我刚开始使用C ++进行CUDA编程，因此对这个简单的问题感到抱歉。 I simply cannot figure out where i am going wrong with this. 我根本无法弄清楚我要怎么做。 I am trying to do a matrix multiplication. 我正在尝试做矩阵乘法。 I have found inspiration from several sources so it might be that i have mixed up some different methods. 我从多个来源找到了灵感，因此可能是我混淆了一些不同的方法。 I am trying to multiply two matrixes h_a and h_b. 我正在尝试将两个矩阵h_a和h_b相乘。 I successfuly generate the two matrixes, but when i allocate the memory for the two matrices, i for some reason lose the values in that matrix, and even after the multiplication all values are zero. 我成功地生成了两个矩阵，但是当我为两个矩阵分配内存时，由于某种原因，我会丢失该矩阵中的值，即使在相乘之后，所有值也均为零。 Below is the code: 下面是代码：

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <ctime>
#include <stdio.h>
#include <iostream>
#include <math.h>

using namespace std;


__global__ void MulKernel(int *c, const int *a, const int *b, const int P)
{
    float tempsum;
    int row = blockIdx.y*blockDim.y + threadIdx.y;
    int col = blockIdx.x*blockDim.x + threadIdx.x;
    if (row < P && col < P){
        for (int i = 0; i < P; i++){
            tempsum += a[row*P + i] * b[i*P + col];
        }
    }
    c[row*P + col] = tempsum;
}


int main()
{

srand(time(NULL));
int *pointer;
int N = 16;
int SIZE = N*N;

int *h_a = new int[SIZE];
int *h_b = new int[SIZE];
int *h_c = new int[SIZE];

for (int i = 0; i < SIZE; i++) {
            h_a[i] = rand() % 1000;
            h_b[i] = rand() % 1000;
    } 
cout << "First values " << h_a[0] << " " << h_b[0] << endl;
    cudaMalloc(&h_a, sizeof(int)*SIZE);
    cudaMalloc(&h_b, sizeof(int)*SIZE);
    cudaMalloc(&h_c, sizeof(int)*SIZE);
    cudaMalloc(&pointer, sizeof(int));

    cout << "Second values " << h_a[0] << " " << h_b[0] << endl;

    cudaMemcpy(h_a, &h_a, sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(h_b, &h_b, sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(pointer, &N, sizeof(int), cudaMemcpyHostToDevice);

    cout << "Third values " << h_a[0] <<" "<< h_b[0] << endl;

    MulKernel <<<1, 256 >>>(h_c, h_a, h_b, N);

    cudaMemcpy(h_c, &h_c, sizeof(int), cudaMemcpyDeviceToHost);
    cudaMemcpy(h_a, &h_a, sizeof(int), cudaMemcpyDeviceToHost);
    cudaMemcpy(h_b, &h_b, sizeof(int), cudaMemcpyDeviceToHost);

    for (int i = 0; i < 5; i++){
        cout << h_c[i] << "=" << h_a[i] << h_b[i] << endl;
    }
    cout << h_c[1] << endl;
    cudaFree(h_a);
    cudaFree(h_b);
    cudaFree(h_c);
    return 0;
}

The output in the terminal reads: 终端中的输出为：

First values 454 964
Second values 0 0
Third values 0 0
0=00
0=00
0=00
0=00
0=00
0
Press any key to continue . . .

I hope someone can spot the error(s) 我希望有人可以发现错误

Best regards 最好的祝福

Answer 1

There are quite a few issues with your code. 您的代码有很多问题。

Any time you're having trouble with a cuda code, I recommend proper cuda error checking as well as running your code with cuda-memcheck . 每当您在使用cuda代码时遇到麻烦时，我建议您进行正确的cuda错误检查以及使用cuda-memcheck运行代码。 In this case, you've made programming errors that actually result in a seg fault, so these methods aren't that useful. 在这种情况下，您已经犯了编程错误，实际上会导致段错误，因此这些方法并不是那么有用。
Your kernel is mostly workable. 您的内核几乎是可行的。 There are 3 issues. 有3个问题。 First, you are performing int multiplication but have declared your tempsum variable as float . 首先，您正在执行int乘法，但已将您的tempsum变量声明为float 。 That probably isn't a huge issue but is not consistent with your kernel. 那可能不是一个大问题，但与您的内核不一致。 Second, you are not initializing tempsum (it should be set to zero). 其次，您不初始化tempsum （应将其设置为零）。 Third, you have your threadcheck (ie if -statement) slightly misplaced. 第三，你有你的threadcheck（即if语句来）略放错了地方。 You should condition the kernel so as not to write to c if the thread is out-of-bounds. 您应该对内核进行条件处理，以便在线程越界时不写c 。
You're probably confused about host and device variables. 您可能对主机和设备变量感到困惑。 We don't allocate a host variable with new and then do a cudaMalloc operation on the same pointer. 我们不使用new分配主机变量，然后在同一指针上执行cudaMalloc操作。 That's not how things work. 事情不是这样的。 We need to create an equivalent set of variables to store data on the device. 我们需要创建一组等效的变量以将数据存储在设备上。 Let's call those *d_a etc. We'll call cudaMalloc on those to allocate device space, then we'll use those in the cudaMemcpy operations as the device-side variables. 我们将其称为*d_a等。我们将在其上调用cudaMalloc来分配设备空间，然后在cudaMemcpy操作中将其用作设备端变量。
Your kernel is expecting a 2D thread array (so that the .x and .y built-in variables in the kernel have meaning). 您的内核需要一个2D线程数组（以便内核中的.x和.y内置变量具有含义）。 But you are defining the thread array using 1D variables. 但是，您正在使用一维变量定义线程数组。 That needs to be fixed in your kernel launch (ie define a 2D array using dim3 variables). 这需要在内核启动时解决（即使用dim3变量定义2D数组）。 Likewise the kernel launch should specify the d_a and etc. variables that are device pointers. 同样，内核启动应指定d_a等作为设备指针的变量。
You may be confused about how to handle a variable like N when passing it to the kernel. 您可能对将N传递给内核时如何处理类似N的变量感到困惑。 We can pass that directly (by value) without any of the gymnastics with pointer that you have created. 我们可以直接（按值）传递它，而无需使用您创建的任何pointer进行体操。
You have transfer sizes wrong in your cudaMemcpy operations. 您的cudaMemcpy操作中传输大小错误。 Like memcpy you need to specify a transfer size in bytes, so we need to multiply most of your transfer sizes by SIZE . 与memcpy一样，您需要以字节为单位指定传输大小，因此我们需要将大多数传输大小乘以SIZE 。

Here's a modified version of your code with the above issues addressed: 这是您的代码的修改后的版本，解决了上述问题：

$ cat t1073.cu
#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <ctime>
#include <stdio.h>
#include <iostream>
#include <math.h>

using namespace std;


__global__ void MulKernel(int *c, const int *a, const int *b, const int P)
{
    int tempsum=0;
    int row = blockIdx.y*blockDim.y + threadIdx.y;
    int col = blockIdx.x*blockDim.x + threadIdx.x;
    if (row < P && col < P){
        for (int i = 0; i < P; i++){
            tempsum += a[row*P + i] * b[i*P + col];
        }
        c[row*P + col] = tempsum;
    }
}


int main()
{

    srand(time(NULL));
    int N = 16;
    int SIZE = N*N;

    int *h_a = new int[SIZE];
    int *h_b = new int[SIZE];
    int *h_c = new int[SIZE];

    for (int i = 0; i < SIZE; i++) {
            h_a[i] = rand() % 1000;
            h_b[i] = rand() % 1000;
    }
    cout << "First values " << h_a[0] << " " << h_b[0] << endl;
    int *d_a, *d_b, *d_c;
    cudaMalloc(&d_a, sizeof(int)*SIZE);
    cudaMalloc(&d_b, sizeof(int)*SIZE);
    cudaMalloc(&d_c, sizeof(int)*SIZE);

    cout << "Second values " << h_a[0] << " " << h_b[0] << endl;

    cudaMemcpy(d_a, h_a, sizeof(int)*SIZE, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, h_b, sizeof(int)*SIZE, cudaMemcpyHostToDevice);

    cout << "Third values " << h_a[0] <<" "<< h_b[0] << endl;

    MulKernel <<<1, dim3(N,N) >>>(d_c, d_a, d_b, N);

    cudaMemcpy(h_c, d_c, sizeof(int)*SIZE, cudaMemcpyDeviceToHost);
    cudaMemcpy(h_a, d_a, sizeof(int)*SIZE, cudaMemcpyDeviceToHost);
    cudaMemcpy(h_b, d_b, sizeof(int)*SIZE, cudaMemcpyDeviceToHost);

    for (int i = 0; i < 5; i++){
        cout << h_c[i] << "=" << h_a[i] << h_b[i] << endl;
    }
    cout << h_c[1] << endl;
    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_c);
    return 0;
}
$ nvcc -o t1073 t1073.cu
$ cuda-memcheck ./t1073
========= CUDA-MEMCHECK
First values 698 173
Second values 698 173
Third values 698 173
5502745=698173
5866060=120710
3945532=646669
4432346=582703
4971909=746272
5866060
========= ERROR SUMMARY: 0 errors
$

Personally, I can't interpret the output easily, and I'm not sure why you've chosen the = sign. 就我个人而言，我无法轻松解释输出，而且我不确定为什么选择了=符号。 For matrix multiplication, c[i] is not equal to a[i]*b[i], if that's what you were thinking. 对于矩阵乘法，如果您正在考虑，则c [i]不等于a [i] * b [i]。 If you want a simple test that is easily understood visually, try setting both a and b matrices to all 1. Then you can easily spot a correct output, it should be all N . 如果您想通过视觉轻松理解一个简单的测试，请尝试将a和b矩阵都设置为全部1。然后您可以轻松地找到正确的输出，它应该全部为N

Also note that for brevity, I've not tried to teach you every aspect of CUDA programming in this question, just fix some mistakes. 还要注意，为简洁起见，我并未尝试在这个问题上教您CUDA编程的各个方面，只是解决了一些错误。 As just one example, this code will break if you set N to a value larger than 32. You may need to learn more about CUDA programming to understand why that is. 仅作为一个示例，如果将N设置为大于32的值，则此代码将中断。您可能需要了解有关CUDA编程的更多信息，以了解为什么会这样。

CUDA c ++，简单矩阵乘法错误

问题描述

1 个解决方案

解决方案1
1 已采纳 2016-02-11 16:03:29

CUDA c ++，简单矩阵乘法错误

问题描述

1 个解决方案

解决方案1 1 已采纳 2016-02-11 16:03:29

解决方案1
1 已采纳 2016-02-11 16:03:29