简体   繁体   中英

Can't use my template class in cuda kernel

I thought I knew how to write some clean cuda code. Until I tried to make a simple template class and use it in a simple kernel. I've been trouble shooting for days. Every single thread I've visited made me feel a little more stupid.

For error checking I used this

Here is my class.h:

#pragma once
template <typename T>
class MyArray
{
public:
    const int size;

    T *data;

    __host__ MyArray(int size); //gpuErrchk(cudaMalloc(&data, size * sizeof(T)));

    __device__ __host__ T GetValue(int); //return data[i]
    __device__ __host__ void SetValue(T, int); //data[i] = val;
    __device__ __host__ T& operator()(int); //return data[i];

    ~MyArray(); //gpuErrchk(cudaFree(data));
};

template class MyArray<double>;

The relevant content of class.cu is in the comments. If you think the whole thing is relevant Id be happy to add it.

Now for the main class:

__global__ void test(MyArray<double> array, double *data, int size)
{
    int j = threadIdx.x;
        //array.SetValue(1, j);  //doesn't work
        //array(j) = 1;  //doesn't work
        //array.data[j] = 1; //doesn't work
        data[j] = 1;   //This does work !
        printf("Reach this code\n");
    }
}
int main(int argc, char **argv)
{
    MyArray x(20);
    test<<<1, 20>>>(x, x.data, 20);

    gpuErrchk(cudaPeekAtLastError());
    gpuErrchk(cudaDeviceSynchronize());
}

When I say "doesn't work", I mean that the program stops there (before reaching the printf) without outputting any error. Plus I get the following error both from cudaDeviceSynchronize and from cudaFree :

an illegal memory access was encountered

What I can't understand is that there should be no issue with memory management since sending the array directly to the kernel works fine. So why doesn't it work when I send a class and try to access the classes data? And why do I receive no warning or error message when clearly my code bumped into some error?

Here is the output of nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Nov__3_21:07:56_CDT_2017
Cuda compilation tools, release 9.1, V9.1.85

(Editorial note: there is quite a bit of disinformation in the comments on this question, so I have assembled an answer as a community wiki entry.)

There is no particular reason why a template class cannot be passed as an argument to a kernel. There are some limitations which need to be clearly understood before doing so:

  1. CUDA kernel arguments, for all intent and purpose, always passed by value. Pass by reference is supported under extremely limited set of circumstances (the argument in question must be stored in managed memory). This does not apply here.
  2. As a result of (1), POD arguments just work, because they are trivially copyable and rely on no special behaviour
  3. Classes are different, in that when you pass a class by value, you are implicitly invoking copy construction or move construction semantics. That means that classes passed by value as kernel arguments must be trivially copy constructable. There is no way to run non trivial copy constructors on the device as part of a kernel launch.
  4. CUDA further requires that classes don't contain virtual members
  5. Although the <<< >>> kernel launch syntax looks like a simple function call, it isn't. There is several layers of abstraction boilerplate and a API call between what you write in host code and what actually is emitted by the toolchain on the host side. This means that there are several copy construction operations between your code and the GPU. If you do something like put a cudaFree call in your destructor, you should assume that it will get called as part of the function call sequence which launches a kernel when one of those copies falls out of scope. You do not want that.

You did not show how the class member functions were actually implemented in this case, so saying why one of the many permutations your code comments hinted at did or did not work is impossible, beyond passing the raw pointer to the kernel, which works because it is a trivially copyable POD value, when the class was almost certainly not.

Here is a simple, complete example showing how to make this work:

$cat classy.cu
#include <vector>
#include <iostream>

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
    if (code != cudaSuccess)
    {
        fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
        if (abort) exit(code);
    }
}

template <typename T>
class MyArray
{
    public:
        int len;
        T *data;

        __device__ __host__ void SetValue(T val, int i) { data[i] = val; };
        __device__ __host__ int size() { return sizeof(T) * len; };

        __host__ void DevAlloc(int N) {
            len = N;
            gpuErrchk(cudaMalloc(&data, size()));
        };

        __host__ void DevFree() {
            gpuErrchk(cudaFree(data));
            len = -1;
        };
};

__global__ void test(MyArray<double> array, double val)
{
    int j = threadIdx.x;
    if (j < array.len)
        array.SetValue(val, j);
}

int main(int argc, char **argv)
{
    const int N = 20;
    const double val = 5432.1;

    gpuErrchk(cudaSetDevice(0));
    gpuErrchk(cudaFree(0));

    MyArray<double> x;
    x.DevAlloc(N);

    test<<<1, 32>>>(x, val);
    gpuErrchk(cudaPeekAtLastError());
    gpuErrchk(cudaDeviceSynchronize());

    std::vector<double> y(N);
    gpuErrchk(cudaMemcpy(&y[0], x.data, x.size(), cudaMemcpyDeviceToHost));
    x.DevFree();

    for(int i=0; i<N; ++i) std::cout << i << " = " << y[i] << std::endl;

    return 0;
}

which compiles and runs like so:

$ nvcc -std=c++11 -arch=sm_53 -o classy classy.cu
$ cuda-memcheck ./classy
========= CUDA-MEMCHECK
0 = 5432.1
1 = 5432.1
2 = 5432.1
3 = 5432.1
4 = 5432.1
5 = 5432.1
6 = 5432.1
7 = 5432.1
8 = 5432.1
9 = 5432.1
10 = 5432.1
11 = 5432.1
12 = 5432.1
13 = 5432.1
14 = 5432.1
15 = 5432.1
16 = 5432.1
17 = 5432.1
18 = 5432.1
19 = 5432.1
========= ERROR SUMMARY: 0 errors

(CUDA 10.2/gcc 7.5 on a Jetson Nano)

Note that I have included host side functions for allocation and deallocation which do not interact with the constructor and destructor. Otherwise the class is extremely similar to your design and has the same properties.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM