封装CUDA内核的最佳方法是什么？

Question

I'm trying to make a CUDA project getting the closest to an OO design as possible. 我正在尝试使CUDA项目尽可能地接近OO设计。 In the moment, the solution that I found is by using a Struct to encapsulate the data and for each method that needs some GPU processing, the implementation of 3 functions are necessary: 目前，我发现的解决方案是使用Struct来封装数据，对于需要进行GPU处理的每个方法，需要实现3个函数：

The method that will be called by the object. 将由对象调用的方法。
A __ global __ function that will call a __ device __ method of that struct. 一个__ global __函数，它将调用该结构的__ device __方法。
A __ device __ method inside the struct. struct中的__ device __方法。

I will give you an example. 我会举个例子。 Lets say I need to implement a method to initialize a buffer inside a struct. 假设我需要实现一个方法来初始化struct中的缓冲区。 It would looks like something like that: 看起来像是这样的：

struct Foo
{
   float *buffer;
   short2 buffer_resolution_;
   short2 block_size_;
   __device__ initBuffer()
   {
      int x = blockIdx.x * blockDim.x + threadIdx.x;
      int y = blockIdx.y * blockDim.y + threadIdx.y;
      int plain_index = (y * buffer_resolution.x) + x;
      if(plain_index < buffer_size)
         buffer[plain_index] = 0;
   }
   void init(const short2 &buffer_resolution, const short2 &block_size)
   {
       buffer_resolution_ = buffer_resolution;
       block_size_ = block_size;
       //EDIT1 - Added the cudaMalloc
       cudaMalloc((void **)&buffer_, buffer_resolution.x * buffer_resolution.y);
       dim3 threadsPerBlock(block_size.x, block_size.y);
       dim3 blocksPerGrid(buffer_resolution.x/threadsPerBlock.x, buffer_resolution.y/threadsPerBlock.y)
       initFooKernel<<<blocksPerGrid, threadsPerBlock>>>(this);
   }
}

__global__ initFooKernel(Foo *foo)
{
   foo->initBuffer();
}

I need to do that because looks like that I cant declare a __ global __ inside the struct. 我需要这样做因为看起来我不能在结构中声明一个__ global __ 。 I've learned this way by looking at some opensource projects, but looks a lot troublesome to implement THREE functions to implement every encapsulated GPU method. 我通过查看一些开源项目已经学会了这种方法，但实现三个函数来实现每个封装的GPU方法看起来很麻烦。 So, my question is: Is that the best/only approach possible? 所以，我的问题是：这是最好的/唯一的方法吗？ Is that even a VALID aproach? 这甚至是一个有效的方法吗？

EDIT1: I forgot to put the cudaMalloc to allocate the buffer before calling initFooKernel. EDIT1：我忘了在调用initFooKernel之前让cudaMalloc分配缓冲区。 Fixed it. 固定它。

Answer 1

Is the goal to make classes that use CUDA while they look like normal classes from the outside? 目标是使用CUDA的类看起来像是来自外部的普通类吗？

If so, to expand on what O'Conbhui was saying, you can just create C style calls for the CUDA functionality and then create a class that wraps those calls. 如果是这样，为了扩展O'Conbhui所说的内容，您可以为CUDA功能创建C样式调用，然后创建一个包装这些调用的类。

So, in a .cu file, you would put definitions for texture references, kernels, C style functions that call the kernels and C style functions that allocate and free GPU memory. 因此，在.cu文件中，您将为纹理引用，内核，调用内核的C样式函数和分配和释放GPU内存的C样式函数添加定义。 In your example, this would include a function that calls a kernel that initializes GPU memory. 在您的示例中，这将包括一个调用初始化GPU内存的内核的函数。

Then, in a corresponding .cpp file, you import a header with declarations for the functions in the .cu file and you define your class. 然后，在相应的.cpp文件中，导入一个包含.cu文件中函数声明的标题，然后定义您的类。 In the constructor, you call the .cu function that allocates CUDA memory and sets up other CUDA resources, such as textures, including your own memory initialization function. 在构造函数中，调用.cu函数，该函数分配CUDA内存并设置其他CUDA资源，例如纹理，包括您自己的内存初始化函数。 In the destructor, you call the functions that free the CUDA resources. 在析构函数中，您可以调用释放CUDA资源的函数。 In your member functions, you call the functions that call kernels. 在您的成员函数中，您可以调用调用内核的函数。

封装CUDA内核的最佳方法是什么？

问题描述

1 个解决方案

解决方案1
3 已采纳 2012-04-16 15:45:07

封装CUDA内核的最佳方法是什么？

问题描述

1 个解决方案

解决方案1 3 已采纳 2012-04-16 15:45:07

解决方案1
3 已采纳 2012-04-16 15:45:07