简体   繁体   English

我可以在CUDA中为设备类和主机类使用向量吗

[英]Can I use vector for both device and host class in CUDA

I am writing a c++ cuda program. 我正在编写一个c ++ cuda程序。 I have a very simple struct: 我有一个非常简单的结构:

struct A
{
int size;
float* tab; 
}

and a kernel: 和一个内核:

__global__ void Kernel(A* res, int n,args*) //
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < n)
{
    res[i] = AGenerator::Generate(args[i]);
}
}

Where AGenerator::Generate creates the A object and fills the tab array. 其中AGenerator :: Generate创建A对象并填充选项卡数组。 What happens here is that when the results are send to the host the tab pointer is invalid. 这里发生的是,当将结果发送到主机时,选项卡指针无效。 To prevent this I will need to apply the Rule of three to this class. 为了防止这种情况,我将需要在该类中应用三规则 Since there would be many classes like this I would like to avoid writing too many additional code. 由于将有许多这样的类,因此我想避免编写过多的附加代码。

I made the research and found that there is a thrust library which has device_vector and host_vector which will probably help with my problem but the thing is that I want the struct A and similar structs to be callable from both host and device so the device and host_vector are not good for this purpose. 我进行了研究,发现有一个推力库,其中包含device_vector和host_vector可能会帮助解决我的问题,但问题是我希望结构A和类似的结构可从主机和设备调用,因此device和host_vector不适用于此目的。 Is there any struct I can use to approach this? 我可以使用任何结构来解决这个问题吗?

EDIT I found that passing the struct by value will help me but since performance is quite important it doesn't seem like a good solution. 编辑我发现按值传递结构将对我有帮助,但是由于性能非常重要,因此它似乎不是一个好的解决方案。

Here is a rough outline of what I had in mind for a custom allocator and pool that would hide some of the mechanics of using a class both on the host and the device. 这是我对自定义分配器和池的大致印象,这将隐藏在主机和设备上使用类的一些机制。

I don't consider it to be a paragon of programming excellence. 我不认为这是卓越编程的典范。 It is merely intended to be a rough outline of the steps that I think would be involved. 它仅旨在粗略地概述我认为将涉及的步骤。 I'm sure there are many bugs. 我敢肯定有很多错误。 I didn't include it, but I think you would want a public method that would get the size as well. 我没有包括它,但是我认为您想要一个公共方法,该方法也将获得该size

#include <iostream>
#include <assert.h>

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

typedef float mytype;

__device__ unsigned int pool_allocated = 0;
__device__ unsigned int pool_size = 0;
__device__ mytype *pool = 0;

__device__ unsigned int pool_reserve(size_t size){
  assert((pool_allocated+size) < pool_size);
  unsigned int offset = atomicAdd(&pool_allocated, size);
  assert (offset < pool_size);
  return offset;
}

__host__ void init_pool(size_t psize){
  mytype *temp;
  unsigned int my_size = psize;
  cudaMalloc((void **)&temp, psize*sizeof(mytype));
  cudaCheckErrors("init pool cudaMalloc fail");
  cudaMemcpyToSymbol(pool, &temp, sizeof(mytype *));
  cudaCheckErrors("init pool cudaMemcpyToSymbol 1 fail");
  cudaMemcpyToSymbol(pool_size, &my_size, sizeof(unsigned int));
  cudaCheckErrors("init pool cudaMemcpyToSymbol 2 fail");
}


class A{
  public:
  mytype *data;
  __host__ __device__ void pool_allocate_and_copy() {
  assert(d_data == 0);
  assert(size != 0);
#ifdef __CUDA_ARCH__
  unsigned int offset = pool_reserve(size);
  d_data = pool + offset;
  memcpy(d_data, data, size*sizeof(mytype));
#else
  cudaMalloc((void **)&d_data, size*sizeof(mytype));
  cudaCheckErrors("pool_allocate_and_copy cudaMalloc fail");
  cudaMemcpy(d_data, data, size*sizeof(mytype), cudaMemcpyHostToDevice);
  cudaCheckErrors("pool_allocate_and_copy cudaMemcpy fail");
#endif /* __CUDA_ARCH__ */

  }
  __host__ __device__ void update(){
#ifdef __CUDA_ARCH__
  assert(data != 0);
  data = d_data;
  assert(data != 0);
#else
  if (h_data == 0) h_data = (mytype *)malloc(size*sizeof(mytype));
  data = h_data;
  assert(data != 0);
  cudaMemcpy(data, d_data, size*sizeof(mytype), cudaMemcpyDeviceToHost);
  cudaCheckErrors("update cudaMempcy fail");
#endif
  }
  __host__ __device__ void allocate(size_t asize) {
    assert(data == 0);
    data = (mytype *)malloc(asize*sizeof(mytype));
    assert(data != 0);
#ifndef __CUDA_ARCH__
    h_data = data;
#endif
    size = asize;
  }
  __host__ __device__ void copyobj(A *obj){
    assert(obj != 0);
#ifdef __CUDA_ARCH__
    memcpy(this, obj, sizeof(A));
#else
    cudaMemcpy(this, obj, sizeof(A), cudaMemcpyDefault);
    cudaCheckErrors("copy cudaMempcy fail");
#endif
    this->update();
  }
  __host__ __device__ A();
    private:
    unsigned int size;
    mytype *d_data;
    mytype *h_data;
};

__host__ __device__ A::A(){
  data = 0;
  d_data = 0;
  h_data = 0;
  size = 0;
}

__global__ void mykernel(A obj, A *res){
  A mylocal;
  mylocal.copyobj(&obj);
  A mylocal2;
  mylocal2.allocate(24);
  mylocal2.data[0]=45;
  mylocal2.pool_allocate_and_copy();
  res->copyobj(&mylocal2);
  printf("kernel data %f\n", mylocal.data[0]);
}




int main(){
  A my_obj;
  A *d_result, h_result;
  my_obj.allocate(32);
  my_obj.data[0] = 12;
  init_pool(1048576);
  my_obj.pool_allocate_and_copy();
  cudaMalloc((void **)&d_result, sizeof(A));
  cudaCheckErrors("main cudaMalloc fail");
  mykernel<<<1,1>>>(my_obj, d_result);
  cudaDeviceSynchronize();
  cudaCheckErrors("kernel fail");
  h_result.copyobj(d_result);
  printf("host data %f\n", h_result.data[0]);

  return 0;
}

I am pretty sure that the direction of the question and related comments are ill fated. 我很确定问题和相关评论的方向是错误的。 Device memory and host memory are totally different things, both conceptually and physically. 从概念上和物理上,设备内存和主机内存是完全不同的东西。 Pointers just don't carry over! 指针就是不要结转!

Please go back to step 1 and learn about copying values between host and device by reading the reference manual and the progamming guide for more details. 请返回步骤1,并通过阅读参考手册编程指南了解有关在主机和设备之间复制值的更多信息。

To get a more precise answer to your question please show how those A structs are allocated on the device including the allocation of those tab floats. 为了更准确地回答您的问题,请说明如何在设备上分配这些A结构, 包括这些tab浮点数的分配。 Also please show how AGenerator::Generate somehow manipulates those tab s in a meaningful way. 还请说明AGenerator::Generate如何以有意义的方式操纵这些tab My best bet is that you are working with unallocated device memory here and that you should probably use a preallocated array of floats and indizes into the array instead of device pointers here. 我最好的选择是,您正在此处使用未分配的设备内存,您可能应该使用预先分配的float数组,并在此处代替该设备的指针进行indize。 Those indices would then carry over to the host gracefully. 这些索引随后将优雅地传递给主机。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM