CUDA c ++，簡單矩陣乘法錯誤

Question

我剛開始使用C ++進行CUDA編程，因此對這個簡單的問題感到抱歉。 我根本無法弄清楚我要怎么做。 我正在嘗試做矩陣乘法。 我從多個來源找到了靈感，因此可能是我混淆了一些不同的方法。 我正在嘗試將兩個矩陣h_a和h_b相乘。 我成功地生成了兩個矩陣，但是當我為兩個矩陣分配內存時，由於某種原因，我會丟失該矩陣中的值，即使在相乘之后，所有值也均為零。 下面是代碼：

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <ctime>
#include <stdio.h>
#include <iostream>
#include <math.h>

using namespace std;


__global__ void MulKernel(int *c, const int *a, const int *b, const int P)
{
    float tempsum;
    int row = blockIdx.y*blockDim.y + threadIdx.y;
    int col = blockIdx.x*blockDim.x + threadIdx.x;
    if (row < P && col < P){
        for (int i = 0; i < P; i++){
            tempsum += a[row*P + i] * b[i*P + col];
        }
    }
    c[row*P + col] = tempsum;
}


int main()
{

srand(time(NULL));
int *pointer;
int N = 16;
int SIZE = N*N;

int *h_a = new int[SIZE];
int *h_b = new int[SIZE];
int *h_c = new int[SIZE];

for (int i = 0; i < SIZE; i++) {
            h_a[i] = rand() % 1000;
            h_b[i] = rand() % 1000;
    } 
cout << "First values " << h_a[0] << " " << h_b[0] << endl;
    cudaMalloc(&h_a, sizeof(int)*SIZE);
    cudaMalloc(&h_b, sizeof(int)*SIZE);
    cudaMalloc(&h_c, sizeof(int)*SIZE);
    cudaMalloc(&pointer, sizeof(int));

    cout << "Second values " << h_a[0] << " " << h_b[0] << endl;

    cudaMemcpy(h_a, &h_a, sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(h_b, &h_b, sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(pointer, &N, sizeof(int), cudaMemcpyHostToDevice);

    cout << "Third values " << h_a[0] <<" "<< h_b[0] << endl;

    MulKernel <<<1, 256 >>>(h_c, h_a, h_b, N);

    cudaMemcpy(h_c, &h_c, sizeof(int), cudaMemcpyDeviceToHost);
    cudaMemcpy(h_a, &h_a, sizeof(int), cudaMemcpyDeviceToHost);
    cudaMemcpy(h_b, &h_b, sizeof(int), cudaMemcpyDeviceToHost);

    for (int i = 0; i < 5; i++){
        cout << h_c[i] << "=" << h_a[i] << h_b[i] << endl;
    }
    cout << h_c[1] << endl;
    cudaFree(h_a);
    cudaFree(h_b);
    cudaFree(h_c);
    return 0;
}

終端中的輸出為：

First values 454 964
Second values 0 0
Third values 0 0
0=00
0=00
0=00
0=00
0=00
0
Press any key to continue . . .

我希望有人可以發現錯誤

最好的祝福

Answer 1

您的代碼有很多問題。

每當您在使用cuda代碼時遇到麻煩時，我建議您進行正確的cuda錯誤檢查以及使用cuda-memcheck運行代碼。 在這種情況下，您已經犯了編程錯誤，實際上會導致段錯誤，因此這些方法並不是那么有用。
您的內核幾乎是可行的。 有3個問題。 首先，您正在執行int乘法，但已將您的tempsum變量聲明為float 。 那可能不是一個大問題，但與您的內核不一致。 其次，您不初始化tempsum （應將其設置為零）。 第三，你有你的threadcheck（即if語句來）略放錯了地方。 您應該對內核進行條件處理，以便在線程越界時不寫c 。
您可能對主機和設備變量感到困惑。 我們不使用new分配主機變量，然后在同一指針上執行cudaMalloc操作。 事情不是這樣的。 我們需要創建一組等效的變量以將數據存儲在設備上。 我們將其稱為*d_a等。我們將在其上調用cudaMalloc來分配設備空間，然后在cudaMemcpy操作中將其用作設備端變量。
您的內核需要一個2D線程數組（以便內核中的.x和.y內置變量具有含義）。 但是，您正在使用一維變量定義線程數組。 這需要在內核啟動時解決（即使用dim3變量定義2D數組）。 同樣，內核啟動應指定d_a等作為設備指針的變量。
您可能對將N傳遞給內核時如何處理類似N的變量感到困惑。 我們可以直接（按值）傳遞它，而無需使用您創建的任何pointer進行體操。
您的cudaMemcpy操作中傳輸大小錯誤。 與memcpy一樣，您需要以字節為單位指定傳輸大小，因此我們需要將大多數傳輸大小乘以SIZE 。

這是您的代碼的修改后的版本，解決了上述問題：

$ cat t1073.cu
#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <ctime>
#include <stdio.h>
#include <iostream>
#include <math.h>

using namespace std;


__global__ void MulKernel(int *c, const int *a, const int *b, const int P)
{
    int tempsum=0;
    int row = blockIdx.y*blockDim.y + threadIdx.y;
    int col = blockIdx.x*blockDim.x + threadIdx.x;
    if (row < P && col < P){
        for (int i = 0; i < P; i++){
            tempsum += a[row*P + i] * b[i*P + col];
        }
        c[row*P + col] = tempsum;
    }
}


int main()
{

    srand(time(NULL));
    int N = 16;
    int SIZE = N*N;

    int *h_a = new int[SIZE];
    int *h_b = new int[SIZE];
    int *h_c = new int[SIZE];

    for (int i = 0; i < SIZE; i++) {
            h_a[i] = rand() % 1000;
            h_b[i] = rand() % 1000;
    }
    cout << "First values " << h_a[0] << " " << h_b[0] << endl;
    int *d_a, *d_b, *d_c;
    cudaMalloc(&d_a, sizeof(int)*SIZE);
    cudaMalloc(&d_b, sizeof(int)*SIZE);
    cudaMalloc(&d_c, sizeof(int)*SIZE);

    cout << "Second values " << h_a[0] << " " << h_b[0] << endl;

    cudaMemcpy(d_a, h_a, sizeof(int)*SIZE, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, h_b, sizeof(int)*SIZE, cudaMemcpyHostToDevice);

    cout << "Third values " << h_a[0] <<" "<< h_b[0] << endl;

    MulKernel <<<1, dim3(N,N) >>>(d_c, d_a, d_b, N);

    cudaMemcpy(h_c, d_c, sizeof(int)*SIZE, cudaMemcpyDeviceToHost);
    cudaMemcpy(h_a, d_a, sizeof(int)*SIZE, cudaMemcpyDeviceToHost);
    cudaMemcpy(h_b, d_b, sizeof(int)*SIZE, cudaMemcpyDeviceToHost);

    for (int i = 0; i < 5; i++){
        cout << h_c[i] << "=" << h_a[i] << h_b[i] << endl;
    }
    cout << h_c[1] << endl;
    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_c);
    return 0;
}
$ nvcc -o t1073 t1073.cu
$ cuda-memcheck ./t1073
========= CUDA-MEMCHECK
First values 698 173
Second values 698 173
Third values 698 173
5502745=698173
5866060=120710
3945532=646669
4432346=582703
4971909=746272
5866060
========= ERROR SUMMARY: 0 errors
$

就我個人而言，我無法輕松解釋輸出，而且我不確定為什么選擇了=符號。 對於矩陣乘法，如果您正在考慮，則c [i]不等於a [i] * b [i]。 如果您想通過視覺輕松理解一個簡單的測試，請嘗試將a和b矩陣都設置為全部1。然后您可以輕松地找到正確的輸出，它應該全部為N

還要注意，為簡潔起見，我並未嘗試在這個問題上教您CUDA編程的各個方面，只是解決了一些錯誤。 僅作為一個示例，如果將N設置為大於32的值，則此代碼將中斷。您可能需要了解有關CUDA編程的更多信息，以了解為什么會這樣。

CUDA c ++，簡單矩陣乘法錯誤

問題描述

1 個解決方案

解決方案1
1 已采納 2016-02-11 16:03:29

CUDA c ++，簡單矩陣乘法錯誤

問題描述

1 個解決方案

解決方案1 1 已采納 2016-02-11 16:03:29

解決方案1
1 已采納 2016-02-11 16:03:29