使用主机上CUDA内核中动态分配的数据

Question

I am trying to build a container class on the device which manages some memory. 我正在尝试在管理一些内存的设备上建立一个容器类。 This memory is allocated dynamically and filled during object construction in the kernel. 该内存是动态分配的，并在内核中的对象构建期间填充。 According to the documentation that can be done with a simple new[] in the kernel (using CUDA 8.0 with compute cabability 5.0 in Visual Studio 2012). 根据可以在内核中使用简单new []完成的文档（在Visual Studio 2012中将CUDA 8.0与计算能力5.0一起使用）。 Afterwards I want to access the data inside the containers in host code (eg for testing if all values are correct). 之后，我想以主机代码访问容器内部的数据（例如，用于测试所有值是否正确）。

A minimal version of the DeviceContainer class looks like this: DeviceContainer类的最低版本如下所示：

class DeviceContainer 
{
public:
   __device__ DeviceContainer(unsigned int size);
   __host__ __device__ ~DeviceContainer();

   __host__ __device__ DeviceContainer(const DeviceContainer & other);
   __host__ __device__ DeviceContainer & operator=(const DeviceContainer & other);

   __host__ __device__ unsigned int getSize() const { return m_sizeData; }
   __device__ int * getDataDevice() const { return mp_dev_data; }
   __host__ int* getDataHost() const;

private:
   int * mp_dev_data;
   unsigned int m_sizeData;
};


__device__ DeviceContainer::DeviceContainer(unsigned int size) :
      m_sizeData(size), mp_dev_data(nullptr) 
{
   mp_dev_data = new int[m_sizeData];

   for(unsigned int i = 0; i < m_sizeData; ++i) {
      mp_dev_data[i] = i;
   }
}


__host__ __device__ DeviceContainer::DeviceContainer(const DeviceContainer & other) : 
  m_sizeData(other.m_sizeData)
{
#ifndef __CUDA_ARCH__
   cudaSafeCall( cudaMalloc((void**)&mp_dev_data, m_sizeData * sizeof(int)) );
   cudaSafeCall( cudaMemcpy(mp_dev_data, other.mp_dev_data, m_sizeData * sizeof(int), cudaMemcpyDeviceToDevice) );
#else
   mp_dev_data = new int[m_sizeData];
   memcpy(mp_dev_data, other.mp_dev_data, m_sizeData * sizeof(int));
#endif
}


__host__ __device__ DeviceContainer::~DeviceContainer()
{
#ifndef __CUDA_ARCH__
   cudaSafeCall( cudaFree(mp_dev_data) );
#else
   delete[] mp_dev_data;
#endif
   mp_dev_data = nullptr;
}


__host__ __device__ DeviceContainer & DeviceContainer::operator=(const DeviceContainer & other)
{
   m_sizeData = other.m_sizeData;

 #ifndef __CUDA_ARCH__
   cudaSafeCall( cudaMalloc((void**)&mp_dev_data, m_sizeData * sizeof(int)) );
   cudaSafeCall( cudaMemcpy(mp_dev_data, other.mp_dev_data, m_sizeData * sizeof(int), cudaMemcpyDeviceToDevice) );
#else
   mp_dev_data = new int[m_sizeData];
   memcpy(mp_dev_data, other.mp_dev_data, m_sizeData * sizeof(int));
#endif

   return *this;
}


__host__ int* DeviceContainer::getDataHost() const
{
   int * pDataHost = new int[m_sizeData];
   cudaSafeCall( cudaMemcpy(pDataHost, mp_dev_data, m_sizeData * sizeof(int), cudaMemcpyDeviceToHost) );
   return pDataHost;
}

It just manages the array mp_dev_data . 它只管理数组mp_dev_data 。 The array is created and filled with consecutive values during construction, which should only be possible on the device. 创建该数组并在构造期间将其填充为连续的值，这仅应在设备上可行。 (Note that in reality the size of the containers might be different from each other.) （请注意，实际上容器的大小可能彼此不同。）

I think I need to provide a copy constructor and an assignment operator since I don't know any other way to fill the array in the kernel. 我想我需要提供一个复制构造函数和一个赋值运算符，因为我不知道其他任何方法来填充内核中的数组。 (See question No. 3 below.) Since copy and deletion can also happen on the host, __CUDA_ARCH__ is used to determine for which execution path we're compiling. （请参阅下面的问题3。）由于复制和删除也可以在主机上进行，因此__CUDA_ARCH__用于确定我们要针对哪个执行路径进行编译。 On the host cudaMemcpy and cudaFree is used, on the device we can just use memcpy and delete[] . 在主机上使用cudaMemcpy和cudaFree ，在设备上我们可以仅使用memcpy和delete[] 。

The kernel for object creation is rather simple: 创建对象的内核非常简单：

__global__ void createContainer(DeviceContainer * pContainer, unsigned int numContainer, unsigned int containerSize)
{
   unsigned int offset = blockIdx.x * blockDim.x + threadIdx.x;

   if(offset < numContainer)
   {
      pContainer[offset] = DeviceContainer(containerSize);
   }
}

Each thread in a one-dimensional grid that is in range creates a single container object. 范围内的一维网格中的每个线程都会创建一个容器对象。

The main-function then allocates arrays for the container (90000 in this case) on the device and host, calls the kernel and attempts to use the objects: 然后，main函数为设备和主机上的容器（在本例中为90000）分配数组，调用内核并尝试使用对象：

void main()
{
   const unsigned int numContainer = 90000;
   const unsigned int containerSize = 5;

   DeviceContainer * pDevContainer;
   cudaSafeCall( cudaMalloc((void**)&pDevContainer, numContainer * sizeof(DeviceContainer)) );

   dim3 blockSize(1024, 1, 1);
   dim3 gridSize((numContainer + blockSize.x - 1)/blockSize.x , 1, 1);

   createContainer<<<gridSize, blockSize>>>(pDevContainer, numContainer, containerSize);
   cudaCheckError();

   DeviceContainer * pHostContainer = (DeviceContainer *)malloc(numContainer * sizeof(DeviceContainer)); 
   cudaSafeCall( cudaMemcpy(pHostContainer, pDevContainer, numContainer * sizeof(DeviceContainer), cudaMemcpyDeviceToHost) );

   for(unsigned int i = 0; i < numContainer; ++i)
   {
      const DeviceContainer & dc = pHostContainer[i];

      int * pData = dc.getDataHost();
      for(unsigned int j = 0; j < dc.getSize(); ++j)
      {
         std::cout << pData[j];
      }
      std::cout << std::endl;
      delete[] pData;
   }

   free(pHostContainer);
   cudaSafeCall( cudaFree(pDevContainer) );
}

I have to use malloc for array creation on the host, since i don't want to have a default constructor for the DeviceContainer . 我必须在主机上使用malloc进行数组创建，因为我不想为DeviceContainer设置默认构造函数。 I try to access the data inside a container via getDataHost() which internally just calls cudaMemcpy . 我尝试通过内部仅调用cudaMemcpy getDataHost()访问容器内的数据。

cudaSafeCall and cudaCheckError are simple macros that evaluate the cudaError returned by the function oder actively poll the last error. cudaSafeCall和cudaCheckError是简单的宏，用于评估函数oder返回的cudaError主动轮询最后一个错误。 For the sake of completeness: 为了完整性：

#define cudaSafeCall(error) __cudaSafeCall(error, __FILE__, __LINE__)
#define cudaCheckError()    __cudaCheckError(__FILE__, __LINE__)

inline void __cudaSafeCall(cudaError error, const char *file, const int line)
{
   if (error != cudaSuccess)
   {
      std::cerr << "cudaSafeCall() returned:" << std::endl;
      std::cerr << "\tFile: " << file << ",\nLine: " << line << " - CudaError " << error << ":" << std::endl;
      std::cerr << "\t" << cudaGetErrorString(error) << std::endl;

      system("PAUSE");
      exit( -1 );
   }
}


inline void __cudaCheckError(const char *file, const int line)
{
   cudaError error = cudaDeviceSynchronize();
   if (error != cudaSuccess)
   {
      std::cerr << "cudaCheckError() returned:" << std::endl;
      std::cerr << "\tFile: " << file << ",\tLine: " << line << " - CudaError " << error << ":" << std::endl;
      std::cerr << "\t" << cudaGetErrorString(error) << std::endl;

      system("PAUSE");
      exit( -1 );
   }
}

I have 3 problems with this code: 这段代码有3个问题：

If it is executed as presented here i recieve an "unspecified launch failure" of the kernel. 如果按此处介绍的那样执行，我将收到内核的“未指定启动失败”。 The Nsight Debugger stops me on the line mp_dev_data = new int[m_sizeData]; Nsight调试器在行mp_dev_data = new int[m_sizeData];上使我停下了mp_dev_data = new int[m_sizeData]; (either in the constructor or the assignment operator) and reports several access violation on global memory. （在构造函数或赋值运算符中），并在全局内存上报告多个访问冲突。 The number of violations appears to be random between 4 and 11 and they occur in non-consecutive threads but always near the upper end of the grid (block 85 and 86). 违反次数似乎在4到11之间是随机的，它们发生在非连续线程中，但始终在网格的上端附近（框85和86）。
If i reduce numContainer to 10, the kernel runs smoothly, however, the cudaMamcpy in getDataHost() fails with an invalid argument error - even though mp_dev_data is not 0. (I suspect that the assignment is faulty and the memory has already been deleted by another object.) 如果我减少numContainer至10时，内核运行顺利，然而， cudaMamcpy在getDataHost()失败无效参数错误-尽管mp_dev_data不为0（我怀疑是分配有问题，内存已经被删除另一个对象。）
Even though I would like to know how to correctly implement the DeviceContainer with proper memory management, in my case it would also be sufficient to make it non-copyable and non-assignable. 即使我想知道如何通过适当的内存管理正确地实现DeviceContainer ，但就我而言，使其不可复制和不可分配也就足够了。 However, I don't know how to properly fill the container-array in the kernel. 但是，我不知道如何在内核中正确填充容器数组。 Maybe something like 也许像
DeviceContainer dc(5); memcpy(&pContainer[offset], &dc, sizeof(DeviceContainer));

Which would lead to problems with deleting mp_dev_data in the destructor. 这将导致在析构函数中删除mp_dev_data问题。 I would need to manually manage memory deletion which feels rather dirty. 我将需要手动管理感觉很脏的内存删除。

I also tried to use malloc and free in kernel code instead of new and delete but the results were the same. 我还尝试在内核代码中使用malloc和free而不是new和delete但是结果是相同的。

I am sorry that I wasn't able to frame my question in a shorter manner. 很抱歉，我无法以简短的方式提出问题。

TL;DR: How to implement a class that dynamically allocates memory in a kernel and can also be used in host code? TL; DR：如何实现在内核中动态分配内存并且也可以在主机代码中使用的类？ How can I initialize an array in a kernel with objects that can not be copied or assigned? 如何在内核中使用无法复制或分配的对象初始化数组？

Any help is appreciated. 任何帮助表示赞赏。 Thank You. 谢谢。

Answer 1

Apparently the answer is: What I am trying to do is more or less impossible. 显然答案是：我试图做的事或多或少是不可能的。 Memory allocated with new or malloc in the kernel is not placed in global memory but rather in a special heap memory which is inaccessible from the host. 在内核中用new或malloc分配的malloc不会放置在全局内存中，而是放在主机无法访问的特殊堆内存中。

The only option to access all memory on the host is to first allocate an array in global memory which is big enough to hold all elements on the heap and then write a kernel that copies all elements from the heap to global memory. 访问主机上所有内存的唯一选择是，首先在全局内存中分配一个足以容纳堆中所有元素的数组，然后编写一个将所有元素从堆复制到全局内存的内核。

The access violation are caused by the limited heap size (which can be changed by cudaDeviceSetLimit(cudaLimitMallocHeapSize, size_t size) . 访问冲突是由有限的堆大小引起的（可以通过cudaDeviceSetLimit(cudaLimitMallocHeapSize, size_t size)进行更改cudaDeviceSetLimit(cudaLimitMallocHeapSize, size_t size) 。

使用主机上CUDA内核中动态分配的数据

问题描述

1 个解决方案

解决方案1
1 2017-02-02 13:38:26

使用主机上CUDA内核中动态分配的数据

问题描述

1 个解决方案

解决方案1 1 2017-02-02 13:38:26

解决方案1
1 2017-02-02 13:38:26