CUDA c++, simple matrix multiplication error

Question

I am quite new at CUDA programming with c++, so sorry for this simple question. I simply cannot figure out where i am going wrong with this. I am trying to do a matrix multiplication. I have found inspiration from several sources so it might be that i have mixed up some different methods. I am trying to multiply two matrixes h_a and h_b. I successfuly generate the two matrixes, but when i allocate the memory for the two matrices, i for some reason lose the values in that matrix, and even after the multiplication all values are zero. Below is the code:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <ctime>
#include <stdio.h>
#include <iostream>
#include <math.h>

using namespace std;


__global__ void MulKernel(int *c, const int *a, const int *b, const int P)
{
    float tempsum;
    int row = blockIdx.y*blockDim.y + threadIdx.y;
    int col = blockIdx.x*blockDim.x + threadIdx.x;
    if (row < P && col < P){
        for (int i = 0; i < P; i++){
            tempsum += a[row*P + i] * b[i*P + col];
        }
    }
    c[row*P + col] = tempsum;
}


int main()
{

srand(time(NULL));
int *pointer;
int N = 16;
int SIZE = N*N;

int *h_a = new int[SIZE];
int *h_b = new int[SIZE];
int *h_c = new int[SIZE];

for (int i = 0; i < SIZE; i++) {
            h_a[i] = rand() % 1000;
            h_b[i] = rand() % 1000;
    } 
cout << "First values " << h_a[0] << " " << h_b[0] << endl;
    cudaMalloc(&h_a, sizeof(int)*SIZE);
    cudaMalloc(&h_b, sizeof(int)*SIZE);
    cudaMalloc(&h_c, sizeof(int)*SIZE);
    cudaMalloc(&pointer, sizeof(int));

    cout << "Second values " << h_a[0] << " " << h_b[0] << endl;

    cudaMemcpy(h_a, &h_a, sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(h_b, &h_b, sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(pointer, &N, sizeof(int), cudaMemcpyHostToDevice);

    cout << "Third values " << h_a[0] <<" "<< h_b[0] << endl;

    MulKernel <<<1, 256 >>>(h_c, h_a, h_b, N);

    cudaMemcpy(h_c, &h_c, sizeof(int), cudaMemcpyDeviceToHost);
    cudaMemcpy(h_a, &h_a, sizeof(int), cudaMemcpyDeviceToHost);
    cudaMemcpy(h_b, &h_b, sizeof(int), cudaMemcpyDeviceToHost);

    for (int i = 0; i < 5; i++){
        cout << h_c[i] << "=" << h_a[i] << h_b[i] << endl;
    }
    cout << h_c[1] << endl;
    cudaFree(h_a);
    cudaFree(h_b);
    cudaFree(h_c);
    return 0;
}

The output in the terminal reads:

First values 454 964
Second values 0 0
Third values 0 0
0=00
0=00
0=00
0=00
0=00
0
Press any key to continue . . .

I hope someone can spot the error(s)

Best regards

Answer 1

There are quite a few issues with your code.

Any time you're having trouble with a cuda code, I recommend proper cuda error checking as well as running your code with cuda-memcheck . In this case, you've made programming errors that actually result in a seg fault, so these methods aren't that useful.
Your kernel is mostly workable. There are 3 issues. First, you are performing int multiplication but have declared your tempsum variable as float . That probably isn't a huge issue but is not consistent with your kernel. Second, you are not initializing tempsum (it should be set to zero). Third, you have your threadcheck (ie if -statement) slightly misplaced. You should condition the kernel so as not to write to c if the thread is out-of-bounds.
You're probably confused about host and device variables. We don't allocate a host variable with new and then do a cudaMalloc operation on the same pointer. That's not how things work. We need to create an equivalent set of variables to store data on the device. Let's call those *d_a etc. We'll call cudaMalloc on those to allocate device space, then we'll use those in the cudaMemcpy operations as the device-side variables.
Your kernel is expecting a 2D thread array (so that the .x and .y built-in variables in the kernel have meaning). But you are defining the thread array using 1D variables. That needs to be fixed in your kernel launch (ie define a 2D array using dim3 variables). Likewise the kernel launch should specify the d_a and etc. variables that are device pointers.
You may be confused about how to handle a variable like N when passing it to the kernel. We can pass that directly (by value) without any of the gymnastics with pointer that you have created.
You have transfer sizes wrong in your cudaMemcpy operations. Like memcpy you need to specify a transfer size in bytes, so we need to multiply most of your transfer sizes by SIZE .

Here's a modified version of your code with the above issues addressed:

$ cat t1073.cu
#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <ctime>
#include <stdio.h>
#include <iostream>
#include <math.h>

using namespace std;


__global__ void MulKernel(int *c, const int *a, const int *b, const int P)
{
    int tempsum=0;
    int row = blockIdx.y*blockDim.y + threadIdx.y;
    int col = blockIdx.x*blockDim.x + threadIdx.x;
    if (row < P && col < P){
        for (int i = 0; i < P; i++){
            tempsum += a[row*P + i] * b[i*P + col];
        }
        c[row*P + col] = tempsum;
    }
}


int main()
{

    srand(time(NULL));
    int N = 16;
    int SIZE = N*N;

    int *h_a = new int[SIZE];
    int *h_b = new int[SIZE];
    int *h_c = new int[SIZE];

    for (int i = 0; i < SIZE; i++) {
            h_a[i] = rand() % 1000;
            h_b[i] = rand() % 1000;
    }
    cout << "First values " << h_a[0] << " " << h_b[0] << endl;
    int *d_a, *d_b, *d_c;
    cudaMalloc(&d_a, sizeof(int)*SIZE);
    cudaMalloc(&d_b, sizeof(int)*SIZE);
    cudaMalloc(&d_c, sizeof(int)*SIZE);

    cout << "Second values " << h_a[0] << " " << h_b[0] << endl;

    cudaMemcpy(d_a, h_a, sizeof(int)*SIZE, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, h_b, sizeof(int)*SIZE, cudaMemcpyHostToDevice);

    cout << "Third values " << h_a[0] <<" "<< h_b[0] << endl;

    MulKernel <<<1, dim3(N,N) >>>(d_c, d_a, d_b, N);

    cudaMemcpy(h_c, d_c, sizeof(int)*SIZE, cudaMemcpyDeviceToHost);
    cudaMemcpy(h_a, d_a, sizeof(int)*SIZE, cudaMemcpyDeviceToHost);
    cudaMemcpy(h_b, d_b, sizeof(int)*SIZE, cudaMemcpyDeviceToHost);

    for (int i = 0; i < 5; i++){
        cout << h_c[i] << "=" << h_a[i] << h_b[i] << endl;
    }
    cout << h_c[1] << endl;
    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_c);
    return 0;
}
$ nvcc -o t1073 t1073.cu
$ cuda-memcheck ./t1073
========= CUDA-MEMCHECK
First values 698 173
Second values 698 173
Third values 698 173
5502745=698173
5866060=120710
3945532=646669
4432346=582703
4971909=746272
5866060
========= ERROR SUMMARY: 0 errors
$

Personally, I can't interpret the output easily, and I'm not sure why you've chosen the = sign. For matrix multiplication, c[i] is not equal to a[i]*b[i], if that's what you were thinking. If you want a simple test that is easily understood visually, try setting both a and b matrices to all 1. Then you can easily spot a correct output, it should be all N .

Also note that for brevity, I've not tried to teach you every aspect of CUDA programming in this question, just fix some mistakes. As just one example, this code will break if you set N to a value larger than 32. You may need to learn more about CUDA programming to understand why that is.

CUDA c++, simple matrix multiplication error

Question

1 answers

solution1
1 ACCPTED 2016-02-11 16:03:29

CUDA c++, simple matrix multiplication error

Question

1 answers

solution1 1 ACCPTED 2016-02-11 16:03:29

solution1
1 ACCPTED 2016-02-11 16:03:29