简体   繁体   English

如何在 CUDA C++ 中的 __host__ 中实现在 __device__ 堆中分配的自定义类型?

[英]How to implement custom type that allocates in the __device__ heap from __host__ in CUDA C++?

What I want to achieve is: implement a custom type that that can be used from de __device__ , it needs to allocate memory on the heap, and I'd like to allocate it from the __host__ .我想要实现的是:实现一个可以从 de __device__使用的自定义类型,它需要在堆上分配 memory ,我想从__host__分配它。 I have this:我有这个:

#include <iostream>

template <typename T>
class Array{
    T *arr;
    size_t* _size;
public:
    
    /**
     * Move constructor; this allows the array to be passed
     * as a return value from a function sapping the pointers
     * and keeping the allocated data on the heap.
     */
    __device__
    Array(Array&& other){
        arr = other.arr;
        other.arr = NULL;
    }


    __host__
    Array(T* other_arr, size_t size){
        cudaMalloc(&_size, sizeof(size_t));
        cudaMalloc(&arr, sizeof(T) * (size + 1));

        cudaMemcpy(_size, &size, sizeof(size_t), cudaMemcpyHostToDevice);
        cudaMemcpy(arr, other_arr, sizeof(T) * size, cudaMemcpyHostToDevice);
    }

    /**
     * Desctructor; dealocate heap
     */
    __host__
    ~Array(){
        cudaFree(_size);
        cudaFree(arr);
    }

    /**
     * Write access to the array
     * @param i index
     * @return reference to i-th element
     */
    __device__
    T &operator[](size_t i){
        if (i > *_size)
            return arr[*_size];
        return arr[i];
    }
    
    /**
     * Read only access to the array
     * @param i index
     * @return reference to i-th element
     */
    __device__
    const T &operator[](size_t i) const {
        if (i > *_size)
            return arr[*_size];
        return arr[i];
    }

    /** 
     * Get array size
     * @return array size
     */
    __device__
    size_t size() const {
        return *_size;
    }

    /** 
     * Resize array droping stored values
     */
    __device__
    void resize(size_t n){
        delete[] arr;
        *_size = n;
        arr = new T[*_size + 1];
    }
}; // class Array

/**
 * Returns the smallest element from an array
 * @param a Array
 * @return smallest element of `a`
 */
template<typename T>
__device__
T min(const Array<T>& a){
    T m = a[0];
    for(size_t i = 1; i < a.size(); i++)
        m = std::min(m, a[i]);
    return m;
}

/**
 * Returns the larges element from an array
 * @param a Array
 * @return larges element of `a`
 */
template<typename T>
__device__
T max(const Array<T>& a){
    T m = a[0];
    for(size_t i = 1; i < a.size(); i++)
        m = std::max(m, a[i]);
    return m;
}

__global__ void k_sum_array(Array<int>* arr, int* s){
    
    *s = 0;
    for(size_t i = 0; i < arr->size(); i++)
        *s += arr->operator[](i);
}

int main(){
    int a[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
    size_t size = 10;

    Array<int> arr(a, size);
    int* s;
    cudaMalloc(&s, sizeof(int));

    k_sum_array<<<1, 1>>>(&arr, s);

    // the next line was causing segfault because you cant access device data from the host (CUDA 101)
    // std::cout << *s << std::endl;

    int hs;
    cudaMemcpy(&hs, s, sizeof(int), cudaMemcpyDeviceToHost);
    std::cout << hs << std::endl;
    

    return 0;
}

It doesn't give the expected result.它没有给出预期的结果。 Any thought on how to achieve what I want?关于如何实现我想要的任何想法?

In CUDA, this won't work:在 CUDA 中,这不起作用:

cudaMalloc(&s, sizeof(int));
...
std::cout << *s << std::endl;

You cannot access device memory from host code.您无法从主机代码访问设备 memory。

This is also problematic:这也是有问题的:

k_sum_array<<<1, 1>>>(&arr, s);
                      ^

The address of arr is a pointer to host memory. arr的地址是指向主机memory 的指针。 That is going to be useless in CUDA device code.这在 CUDA 设备代码中将毫无用处。 Let's recap.让我们回顾一下。 In CUDA:在 CUDA 中:

  1. Host code cannot directly access (ordinary) device memory.主机代码不能直接访问(普通)设备 memory。
  2. Device code cannot directly access (ordinary) host memory.设备代码不能直接访问(普通)主机 memory。

The first issue is fairly straightforward to fix.第一个问题很容易解决。 You have already edited your post to do that.你已经编辑了你的帖子来做到这一点。

The second issue requires some refactoring, and I'm sure there are several ways to proceed at this point:第二个问题需要进行一些重构,我相信此时有几种方法可以继续:

  1. Use pass-by-pointer correctly (copy the object to the device first)正确使用pass-by-pointer(先将object复制到设备上)
  2. Use pass-by-value使用按值传递
  3. Use managed memory使用托管 memory
  4. probably other methods可能是其他方法

The thing I observe is that there is basically no need to pass arr by pointer, passing by value should be fine.我观察到的事情是基本上不需要通过指针传递arr ,通过值传递应该没问题。 CUDA handles that properly. CUDA 处理得当。 But it can be somewhat involved.但它可能有点牵涉。

If we convert to pass-by-value, then we need to refactor the device code accordingly.如果我们转换为按值传递,那么我们需要相应地重构设备代码。 Additionally, pass-by-value in C++, when passing objects, creates an implicit object creation/destruction sequence around the function call, to support pass-by-value.此外,C++ 中的按值传递,在传递对象时,会在 function 调用周围创建隐式 object 创建/销毁序列,以支持按值传递。 This complicates operations around the kernel call.这使得围绕 kernel 调用的操作变得复杂。 The object destructor will get called implicitly, and this sometimes trips people up . object 析构函数将被隐式调用,这有时会让人绊倒 A simple solution is not to call cudaFree in the destructor.一个简单的解决方案是不在析构函数中调用cudaFree In addition, your object copy constructor is wrong (doesn't copy _size ) and we will need an additional form of the copy-constructor due to the kernel call pass-by-value mechanism.此外,您的 object 复制构造函数是错误的(不复制_size ),由于 kernel 调用传递值机制,我们将需要其他形式的复制构造函数。

So in the interest of simplicity, I'll show a refactoring using pass-by-pointer.所以为了简单起见,我将展示一个使用 pass-by-pointer 的重构。

The following code makes the change to provide the object as a proper entity in device memory.以下代码进行更改以提供 object 作为设备 memory 中的适当实体。 The only changes are in main around the handling of arr :唯一的变化是main围绕arr的处理:

$ cat t2104.cu
#include <iostream>

template <typename T>
class Array{
    T *arr;
    size_t* _size;
public:

    /**
     * Move constructor; this allows the array to be passed
     * as a return value from a function sapping the pointers
     * and keeping the allocated data on the heap.
     */
    __device__
    Array(Array&& other){
        arr = other.arr;
        other.arr = NULL;
    }


    __host__
    Array(T* other_arr, size_t size){
        cudaMalloc(&_size, sizeof(size_t));
        cudaMalloc(&arr, sizeof(T) * (size + 1));

        cudaMemcpy(_size, &size, sizeof(size_t), cudaMemcpyHostToDevice);
        cudaMemcpy(arr, other_arr, sizeof(T) * size, cudaMemcpyHostToDevice);
    }

    /**
     * Desctructor; dealocate heap
     */
    __host__
    ~Array(){
        cudaFree(_size);
        cudaFree(arr);
    }

    /**
     * Write access to the array
     * @param i index
     * @return reference to i-th element
     */
    __device__
    T &operator[](size_t i){
        if (i > *_size)
            return arr[*_size];
        return arr[i];
    }

    /**
     * Read only access to the array
     * @param i index
     * @return reference to i-th element
     */
    __device__
    const T &operator[](size_t i) const {
        if (i > *_size)
            return arr[*_size];
        return arr[i];
    }

    /**
     * Get array size
     * @return array size
     */
    __device__
    size_t size() const {
        return *_size;
    }

    /**
     * Resize array droping stored values
     */
    __device__
    void resize(size_t n){
        delete[] arr;
        *_size = n;
        arr = new T[*_size + 1];
    }
}; // class Array

/**
 * Returns the smallest element from an array
 * @param a Array
 * @return smallest element of `a`
 */
template<typename T>
__device__
T min(const Array<T>& a){
    T m = a[0];
    for(size_t i = 1; i < a.size(); i++)
        m = std::min(m, a[i]);
    return m;
}

/**
 * Returns the larges element from an array
 * @param a Array
 * @return larges element of `a`
 */
template<typename T>
__device__
T max(const Array<T>& a){
    T m = a[0];
    for(size_t i = 1; i < a.size(); i++)
        m = std::max(m, a[i]);
    return m;
}

__global__ void k_sum_array(Array<int>* arr, int* s){

    *s = 0;
    for(size_t i = 0; i < arr->size(); i++)
        *s += arr->operator[](i);
}

int main(){
    int a[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
    size_t size = 10;

    Array<int> arr(a, size);
    Array<int> *d_arr;
    cudaMalloc(&d_arr, sizeof(Array<int>));
    cudaMemcpy(d_arr, &arr, sizeof(Array<int>), cudaMemcpyHostToDevice);
    int* s;
    cudaMalloc(&s, sizeof(int));

    k_sum_array<<<1, 1>>>(d_arr, s);

    // the next line was causing segfault because you cant access device data from the host (CUDA 101)
    // std::cout << *s << std::endl;

    int hs;
    cudaMemcpy(&hs, s, sizeof(int), cudaMemcpyDeviceToHost);
    std::cout << hs << std::endl;


    return 0;
}
$ nvcc -o t2104 t2104.cu
$ compute-sanitizer ./t2104
========= COMPUTE-SANITIZER
55
========= ERROR SUMMARY: 0 errors
$

I'm not suggesting this fixes every possible defect in your code, merely that it seems to address the proximal issue(s) and seems to return the correct answer for the test case you have actually provided.我并不是建议这修复代码中所有可能的缺陷,只是它似乎解决了最近的问题,并且似乎为您实际提供的测试用例返回了正确的答案。 I've already indicated that I don't think your copy-constructor is right, I haven't liked at other functions like min and max , and I definitely find your handling/definition of the array _size to be quite strange, but none of that seems to be relevant to your test case.我已经表明我认为您的复制构造函数不正确,我不喜欢minmax等其他函数,而且我肯定发现您对数组_size的处理/定义很奇怪,但没有其中似乎与您的测试用例有关。

I'm also completely ignoring your usage of the "device heap" terminology.我也完全忽略了您对“设备堆”术语的使用。 I don't think you are using that terminology in a fashion that is consistent with how CUDA defines that but it doesn't seem to be important to the discussion of the code you presented.我认为您使用该术语的方式与CUDA 的定义方式不一致,但这对于讨论您提供的代码似乎并不重要。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM