简体   繁体   English

CUDA cudaMemcpy数组结构

[英]CUDA cudaMemcpy Struct of Arrays

I'd like to clean up the parameters of CUDA kernels in my project. 我想在我的项目中清理CUDA内核的参数。


Now, a kernel needs 3 uint32_t arrays, which leads to pretty ugly code: (id means the global thread id and valX is some arbitrary value) 现在,一个内核需要3个uint32_t数组,这导致相当丑陋的代码:(id表示全局线程id和valX是一些任意值)

__global__ void some_kernel(uint32_t * arr1, uint32_t * arr2, uint32_t * arr3){arr1[id] = val1; arr2[id] = val2; arr3[id] = val3;}

I'd like to sorround all those arrays with a struct: 我想用结构来覆盖所有这些数组:

typedef struct S{uint_32_t arr1, uint_32_t arr2, uint_32_t arr3, uint32_t size} S;

where size denotes the length of every arrX inside the struct. 其中size表示结构中每个arrX的长度。

What I would like to have, is something like: 我想拥有的是:

__global__ void some_kernel(S * s){s->arr1[id] = val1; s->arr2[id] = val2; s->arr3[id] = val3;}

What would a corresponding cudaMalloc and cudaMemcpy would look like for a struct like this? 对于像这样的结构,相应的cudaMalloc和cudaMemcpy会是什么样子? Are there any performance drawbacks from this, which I'm not seeing yet? 这有什么性能缺点,我还没有看到吗?

Thanks in advance! 提前致谢!

You have at least two options. 您至少有两个选择。 One excellent choice was already given by talonmies, but I'll introduce you to the "learn the hard way" approach. talonmies 已经给出了一个很好的选择,但我将向您介绍“学习艰难的方法”的方法。

First, your struct definition: 首先,你的结构定义:

typedef struct S {
    uint32_t *arr1;
    uint32_t *arr2;
    uint32_t *arr3; 
    uint32_t size;
} S;

...and kernel definition (with some global variable, but you don't need to follow with that pattern): ...和内核定义(使用一些全局变量,但您不需要遵循该模式):

const int size = 10000;

__global__ void some_kernel(S *s)
{
    int id = blockIdx.x * blockDim.x + threadIdx.x;
    if (id < size)
    {
        s->arr1[id] = 1; // val1
        s->arr2[id] = 2; // val2
        s->arr3[id] = 3; // val3
    }
}

Notice that if protects you from running out-of-bounds. 请注意, if保护您免受越界限制。

Next, we come with some function that prepares data, executes kernel and prints some result. 接下来,我们提供了一些准备数据,执行内核并打印一些结果的函数。 Part one is data allocation: 第一部分是数据分配:

uint32_t *host_arr1, *host_arr2, *host_arr3;
uint32_t *dev_arr1, *dev_arr2, *dev_arr3;

// Allocate and fill host data
host_arr1 = new uint32_t[size]();
host_arr2 = new uint32_t[size]();
host_arr3 = new uint32_t[size]();

// Allocate device data   
cudaMalloc((void **) &dev_arr1, size * sizeof(*dev_arr1));
cudaMalloc((void **) &dev_arr2, size * sizeof(*dev_arr2));
cudaMalloc((void **) &dev_arr3, size * sizeof(*dev_arr3));

// Allocate helper struct on the device
S *dev_s;
cudaMalloc((void **) &dev_s, sizeof(*dev_s));

It's nothing special, you just allocate three arrays and struct. 没什么特别的,你只需要分配三个数组和结构。 What looks more interesting is how to handle copying of such data into device: 更有趣的是如何处理将此类数据复制到设备中:

// Copy data from host to device
cudaMemcpy(dev_arr1, host_arr1, size * sizeof(*dev_arr1), cudaMemcpyHostToDevice);
cudaMemcpy(dev_arr2, host_arr2, size * sizeof(*dev_arr2), cudaMemcpyHostToDevice);
cudaMemcpy(dev_arr3, host_arr3, size * sizeof(*dev_arr3), cudaMemcpyHostToDevice);

// NOTE: Binding pointers with dev_s
cudaMemcpy(&(dev_s->arr1), &dev_arr1, sizeof(dev_s->arr1), cudaMemcpyHostToDevice);
cudaMemcpy(&(dev_s->arr2), &dev_arr2, sizeof(dev_s->arr2), cudaMemcpyHostToDevice);
cudaMemcpy(&(dev_s->arr3), &dev_arr3, sizeof(dev_s->arr3), cudaMemcpyHostToDevice);

Beside ordinary copy of array you noticed, that it's also neccessary to "bind" them with the struct. 除了您注意到的普通数组副本之外,还需要将它们与结构“绑定”在一起。 For that you need to pass an address of pointer. 为此,您需要传递指针的地址。 As result, only these pointers are copied. 结果,只复制这些指针。

Next kernel call, copy data back again to host and printing results: 下一次内核调用,将数据再次复制回主机并打印结果:

// Call kernel
some_kernel<<<10000/256 + 1, 256>>>(dev_s); // block size need to be a multiply of 256

// Copy result to host:
cudaMemcpy(host_arr1, dev_arr1, size * sizeof(*host_arr1), cudaMemcpyDeviceToHost);
cudaMemcpy(host_arr2, dev_arr2, size * sizeof(*host_arr2), cudaMemcpyDeviceToHost);
cudaMemcpy(host_arr3, dev_arr3, size * sizeof(*host_arr3), cudaMemcpyDeviceToHost);

// Print some result
std::cout << host_arr1[size-1] << std::endl;
std::cout << host_arr2[size-1] << std::endl;
std::cout << host_arr3[size-1] << std::endl;

Keep in mind that in any serious code you should always check for errors from CUDA API calls. 请记住,在任何严肃的代码中,您应始终检查CUDA API调用中的错误。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM