简体   繁体   English

当我调用异步CUDA内核时,它的参数是如何复制的?

[英]When I invoke an asynchronous CUDA kernel, how are its arguments copied?

Say I want to invoke a CUDA kernel, like this: 假设我想调用CUDA内核,如下所示:

struct foo { int a; int b; float c; double d; }
foo arg;
// fill in elements of `arg` here
my_kernel<<<grid_size, block_size, 0, stream>>>(arg);

Assume that stream was previously created using a call to cudaStreamCreate() , so the above will execute asynchronously. 假设先前使用对cudaStreamCreate()的调用创建了stream ,因此上面将异步执行。 I'm concerned about the required lifetime of arg . 我担心arg的所需寿命。

Are the arguments to the kernel copied synchronously when I invoke it (so it would be safe for arg to go out of scope immediately), or are they copied asynchronously (so I need to ensure that it stays alive until the kernel runs)? 当我调用它时,内核的参数是否同步复制(因此arg可以安全地超出范围),或者它们是否异步复制(所以我需要确保它在内核运行之前保持活动状态)?

Arguments are copied synchronously at launch. 在启动时同步复制参数。 The API exposes a call stack onto which execution parameters and function arguments are pushed in order, then a call finalises those arguments into a CUDA kernel launch on the drivers internal streams/command queues. API公开一个调用堆栈,按顺序将执行参数和函数参数推送到该调用堆栈,然后调用将这些参数最终确定为驱动程序内部流/命令队列上的CUDA内核启动。

This process isn't documented, but as of CUDA 7.5, a runtime API kernel launch like this: 此过程未记录,但从CUDA 7.5开始,运行时API内核启动如下:

dot_product<<<1,n>>>(n, d_a, d_b);

becomes this: 成为这个:

(cudaConfigureCall(1, n)) ? (void)0 : (dot_product)(n, d_a, d_b);

where the host stub function dot_product is expanded into this: 主机存根函数dot_product扩展为:

void __device_stub__Z11dot_productiPfS_(int __par0, float *__par1, float *__par2)
{
    if (cudaSetupArgument((void *)(char *)&__par0, sizeof(__par0), (size_t)0UL) != cudaSuccess) return;
    if (cudaSetupArgument((void *)(char *)&__par1, sizeof(__par1), (size_t)8UL) != cudaSuccess) return;
    if (cudaSetupArgument((void *)(char *)&__par2, sizeof(__par2), (size_t)16UL) != cudaSuccess) return;
    {
        volatile static char *__f __attribute__((unused)); __f = ((char *)((void ( *)(int, float *, float *))dot_product)); 
        (void)cudaLaunch(((char *)((void ( *)(int, float *, float *))dot_product))); 
    };
}

void dot_product( int __cuda_0,float *__cuda_1,float *__cuda_2)
{
    __device_stub__Z11dot_productiPfS_( __cuda_0,__cuda_1,__cuda_2);
}

cudaSetupArgument is the API call which is pushing arguments onto the call stack. cudaSetupArgument是API调用,它将参数推送到调用堆栈。 Interestingly, this is actually deprecated in the API documentation for CUDA 7.5, even though the compiler is using it. 有趣的是,这在CUDA 7.5的API文档中实际上已被弃用,即使编译器正在使用它。 I would, therefore, expect this to change in the future, but the idea will be the same. 因此,我希望将来能够改变,但这个想法是一样的。

The parameters of the kernel call are copied prior to execution, so the scope schould be of no concern. 内核调用的参数在执行之前被复制,因此范围应该是无关紧要的。 But please note that the size of all kernel parameters cannot exceed a maximum size in bytes. 但请注意,所有内核参数的大小不能超过最大大小(以字节为单位)。 If you want larger structs or blobs of data you need to allocate the used memory on the device using cudaMalloc, then copy the content of the host struct to the device struct using cudaMemcpy and call the kernel with a pointer to the new device struct. 如果需要更大的结构或blob数据,则需要使用cudaMalloc在设备上分配已使用的内存,然后使用cudaMemcpy将主机结构的内容复制到设备结构,并使用指向新设备结构的指针调用内核。

Your code would look something like this: 您的代码看起来像这样:

struct foo { int a; int b; float c; double d; }
foo arg;
foo *arg_d;
// fill in elements of `arg` here

cudaMalloc(&arg_d, sizeof(foo));
// check the allocation here
cudaMemcpy(arg_d, &arg, sizeof(foo), cudaMemcpyHostToDevice);
my_kernel<<<grid_size, block_size, 0, stream>>>(arg_d);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM