CUDA：用C ++包装设备内存分配

Question

I'm starting to use CUDA at the moment and have to admit that I'm a bit disappointed with the C API. 我现在开始使用CUDA并且不得不承认我对C API有点失望。 I understand the reasons for choosing C but had the language been based on C++ instead, several aspects would have been a lot simpler, eg device memory allocation (via cudaMalloc ). 我理解选择C的原因但是语言基于C ++而不是，有几个方面会更简单，例如设备内存分配（通过cudaMalloc ）。

My plan was to do this myself, using overloaded operator new with placement new and RAII (two alternatives). 我的计划是自己做这个，使用重载的operator new with placement new和RAII（两个选择）。 I'm wondering if there are any caveats that I haven't noticed so far. 我想知道到目前为止我是否有任何警告。 The code seems to work but I'm still wondering about potential memory leaks. 代码似乎工作，但我仍然想知道潜在的内存泄漏。

The usage of the RAII code would be as follows: RAII代码的用法如下：

CudaArray<float> device_data(SIZE);
// Use `device_data` as if it were a raw pointer.

Perhaps a class is overkill in this context (especially since you'd still have to use cudaMemcpy , the class only encapsulating RAII) so the other approach would be placement new : 也许一个类在这种情况下是过度的（特别是因为你仍然必须使用cudaMemcpy ，这个类只封装RAII）所以另一种方法是放置new ：

float* device_data = new (cudaDevice) float[SIZE];
// Use `device_data` …
operator delete [](device_data, cudaDevice);

Here, cudaDevice merely acts as a tag to trigger the overload. 在这里， cudaDevice仅作为触发重载的标记。 However, since in normal placement new this would indicate the placement, I find the syntax oddly consistent and perhaps even preferable to using a class. 然而，由于在正常放置new ，这将表明的位置，我觉得奇怪的语法一致，甚至优于使用类。

I'd appreciate criticism of every kind. 我很欣赏各种批评。 Does somebody perhaps know if something in this direction is planned for the next version of CUDA (which, as I've heard, will improve its C++ support, whatever they mean by that). 是否有人知道是否计划为下一版本的CUDA（正如我所听到的那样，将改进其C ++支持，无论它们的含义是什么）。

So, my question is actually threefold: 所以，我的问题实际上有三个：

Is my placement new overload semantically correct? 我的位置new过载在语义上是否正确？ Does it leak memory? 它会泄漏内存吗？
Does anybody have information about future CUDA developments that go in this general direction (let's face it: C interfaces in C++ s*ck)? 有没有人有关于未来CUDA发展的信息，这是朝着这个大方向发展的（让我们面对它：C ++ s * ck中的C接口）？
How can I take this further in a consistent manner (there are other APIs to consider, eg there's not only device memory but also a constant memory store and texture memory)? 如何以一致的方式进一步采取这种方式（还有其他需要考虑的API，例如，不仅有设备内存，还有常量内存和纹理内存）？

// Singleton tag for CUDA device memory placement.
struct CudaDevice {
    static CudaDevice const& get() { return instance; }
private:
    static CudaDevice const instance;
    CudaDevice() { }
    CudaDevice(CudaDevice const&);
    CudaDevice& operator =(CudaDevice const&);
} const& cudaDevice = CudaDevice::get();

CudaDevice const CudaDevice::instance;

inline void* operator new [](std::size_t nbytes, CudaDevice const&) {
    void* ret;
    cudaMalloc(&ret, nbytes);
    return ret;
}

inline void operator delete [](void* p, CudaDevice const&) throw() {
    cudaFree(p);
}

template <typename T>
class CudaArray {
public:
    explicit
    CudaArray(std::size_t size) : size(size), data(new (cudaDevice) T[size]) { }

    operator T* () { return data; }

    ~CudaArray() {
        operator delete [](data, cudaDevice);
    }

private:
    std::size_t const size;
    T* const data;

    CudaArray(CudaArray const&);
    CudaArray& operator =(CudaArray const&);
};

About the singleton employed here: Yes, I'm aware of its drawbacks. 关于这里使用的单身人士：是的，我知道它的缺点。 However, these aren't relevant in this context. 但是，这些与此无关。 All I needed here was a small type tag that wasn't copyable. 我在这里需要的只是一个不可复制的小型标签。 Everything else (ie multithreading considerations, time of initialization) don't apply. 其他所有内容（即多线程注意事项，初始化时间）都不适用。

Answer 1

In the meantime there were some further developments (not so much in terms of the CUDA API, but at least in terms of projects attempting an STL-like approach to CUDA data management). 与此同时，还有一些进一步的发展（在CUDA API方面并不多，但至少在试图采用类似STL的CUDA数据管理方法的项目方面）。

Most notably there is a project from NVIDIA research: thrust 最值得注意的是，有一个来自NVIDIA研究的项目：推力

Answer 2

I would go with the placement new approach. 我会选择新的方法。 Then I would define a class that conforms to the std::allocator<> interface. 然后我将定义一个符合std :: allocator <>接口的类。 In theory, you could pass this class as a template parameter into std::vector<> and std::map<> and so forth. 理论上，您可以将此类作为模板参数传递给std :: vector <>和std :: map <>等等。

Beware, I have heard that doing such things is fraught with difficulty, but at least you will learn a lot more about the STL this way. 请注意，我听说做这些事情充满了困难，但至少你会以这种方式学习更多关于STL的知识。 And you do not need to re-invent your containers and algorithms. 而且您不需要重新发明容器和算法。

Answer 3

There are several projects that attempt something similar, for example CUDPP . 有几个项目尝试类似的东西，例如CUDPP 。

In the meantime, however, I've implemented my own allocator and it works well and was straightforward (> 95% boilerplate code). 然而，与此同时，我已经实现了自己的分配器，它运行良好并且很简单（> 95％样板代码）。

Answer 4

Does anybody have information about future CUDA developments that go in this general direction (let's face it: C interfaces in C++ s*ck)? 有没有人有关于未来CUDA发展的信息，这是朝着这个大方向发展的（让我们面对它：C ++ s * ck中的C接口）？

Yes, I've done something like that: 是的，我做过类似的事情：

https://github.com/eyalroz/cuda-api-wrappers/ https://github.com/eyalroz/cuda-api-wrappers/

nVIDIA's Runtime API for CUDA is intended for use both in C and C++ code. nVIDIA的CUDA运行时API适用于C和C ++代码。 As such, it uses a C-style API, the lower common denominator (with a few notable exceptions of templated function overloads). 因此，它使用C风格的API，较低的公分母（除了模板化函数重载之外的一些值得注意的例外）。

This library of wrappers around the Runtime API is intended to allow us to embrace many of the features of C++ (including some C++11) for using the runtime API - but without reducing expressivity or increasing the level of abstraction (as in, eg, the Thrust library). 这个围绕Runtime API的包装器库旨在允许我们使用C ++的许多功能（包括一些C ++ 11）来使用运行时API - 但不会降低表达性或提高抽象级别（如，例如，，Thrust图书馆）。 Using cuda-api-wrappers, you still have your devices, streams, events and so on - but they will be more convenient to work with in more C++-idiomatic ways. 使用cuda-api-wrappers，你仍然拥有你的设备，流，事件等等 - 但是用更多的C ++ - 惯用方式来处理它们会更方便。

CUDA：用C ++包装设备内存分配

问题描述

4 个解决方案

解决方案1
7 2010-07-22 06:51:47

解决方案2
5 已采纳 2008-11-19 01:26:17

解决方案3
2 2008-11-19 17:55:56

解决方案4
2 2018-02-22 09:33:32

CUDA：用C ++包装设备内存分配

问题描述

4 个解决方案

解决方案1 7 2010-07-22 06:51:47

解决方案2 5 已采纳 2008-11-19 01:26:17

解决方案3 2 2008-11-19 17:55:56

解决方案4 2 2018-02-22 09:33:32

解决方案1
7 2010-07-22 06:51:47

解决方案2
5 已采纳 2008-11-19 01:26:17

解决方案3
2 2008-11-19 17:55:56

解决方案4
2 2018-02-22 09:33:32