简体   繁体   English

我可以在不传递指针数组的情况下启动协作内核吗?

[英]Can I launch a cooperative kernel without passing an array of pointers?

The CUDA runtime API allows us to launch kernels using the variable-number-of-arguments triple-chevron syntax: CUDA运行时API允许我们使用变量数量参数triple-chevron语法启动内核:

my_kernel<<<grid_dims, block_dims, shared_mem_size>>>(
    first_arg, second_arg, and_as_many, as_we, want_to, etc, etc);

but as regards "collaborative" kernels, the CUDA Programming Guide says ( section C.3 ): 但关于“协作”内核,CUDA编程指南说( C.3节 ):

To enable grid synchronization, when launching the kernel it is necessary to use, instead of the <<<...>>> execution configuration syntax, the cuLaunchCooperativeKernel CUDA runtime launch API: 要启用网格同步,在启动内核时,必须使用cuLaunchCooperativeKernel CUDA运行时启动API而不是<<<...>>>执行配置语法:

 cudaLaunchCooperativeKernel( const T *func, dim3 gridDim, dim3 blockDim, void **args, size_t sharedMem = 0, cudaStream_t stream = 0 ) 

(or the CUDA driver equivalent). (或CUDA驱动程序等效)。

I would rather not have to write my own wrapper code for building an array of pointers... is there really no facility in the runtime API to avoid that? 我宁愿不必编写自己的包装器代码来构建指针数组......运行时API中是否真的没有设施可以避免这种情况?

The answer is no. 答案是不。

Under the hood, the <<< >>> syntax gets expanded like this: 引擎盖下, <<< >>>语法扩展如下:

deviceReduceBlockKernel0<<<nblocks, 256>>>(input, scratch, N);

becomes: 变为:

(cudaConfigureCall(nblocks, 256)) ? (void)0 : deviceReduceBlockKernel0(input, scratch, N); 

and a boilerplate wrapper function gets emitted: 并发出样板包装函数:

void deviceReduceBlockKernel0(int *in, int2 *out, int N) ;

// ....

void deviceReduceBlockKernel0( int *__cuda_0,struct int2 *__cuda_1,int __cuda_2)
{
__device_stub__Z24deviceReduceBlockKernel0PiP4int2i(_cuda_0,__cuda_1,__cuda_2);
}

void __device_stub__Z24deviceReduceBlockKernel1P4int2Pii( struct int2 *__par0,  int *__par1,  int __par2) 
{  
    __cudaSetupArgSimple(__par0, 0UL); 
    __cudaSetupArgSimple(__par1, 8UL); 
    __cudaSetupArgSimple(__par2, 16UL); 
    __cudaLaunch(((char *)((void ( *)(struct int2 *, int *, int))deviceReduceBlockKernel1))); 
}

ie. 即。 the toolchain is just automagically doing what you would have to do yourself by hand (or via fancy generator templates) in code when you explicitly use the kernel launch APIs, be they the conventional single launch or new cooperative launch APIs. 当您明确使用内核启动API时,工具链只是自动执行您在代码中手动(或通过花哨的生成器模板)所做的事情,无论是传统的单一启动还是新的协作启动API。 In the deprecated version of the APIs, there is an internal stack which does the dirty work for you. 在不推荐使用的API版本中,有一个内部堆栈可以为您执行脏工作。 In the newer APIs, you make arrays of arguments yourself. 在较新的API中,您可以自己创建参数数组。 Same thing, just different dog food. 同样的事情,只是不同的狗食。

FWIW you can pass arbitrary structs (not immediately obvious from API docs) by just passing it via void* args. FWIW你可以通过void * args传递任意结构(从API文档中不是很明显)。 It's not obvious that the sizeof gets computed by the compiler in this case from the function signature and the right size is copied to the kernel. 在这种情况下,编译器会根据函数签名计算sizeof并将正确的大小复制到内核,这一点并不明显。 The API docs don't seem to elaborate on that. API文档似乎没有详细说明。

struct Param { int a, b; void* device_ptr; };
Param param{aa, bb, d_ptr};
void *kArgs = {&param};
cudaLaunchCooperativeKernel(..., kArgs, ...);

We can use something like the following workaround (requires --std=c++11 or higher): 我们可以使用类似以下的解决方法(需要--std=c++11或更高版本):

namespace detail {

template <typename F, typename... Args>
void for_each_argument_address(F f, Args&&... args) {
    [](...){}((f( (void*) &std::forward<Args>(args) ), 0)...);
}

} // namespace detail

template<typename KernelFunction, typename... KernelParameters>
inline void cooperative_launch(
    const KernelFunction&       kernel_function,
    stream::id_t                stream_id,
    launch_configuration_t      launch_configuration,
    KernelParameters...         parameters)
{
    void* arguments_ptrs[sizeof...(KernelParameters)];
    auto arg_index = 0;
    detail::for_each_argument_address(
        [&](void * x) {arguments_ptrs[arg_index++] = x;},
        parameters...);
    cudaLaunchCooperativeKernel<KernelFunction>(
        &kernel_function,
        launch_configuration.grid_dimensions,
        launch_configuration.block_dimensions,
        arguments_ptrs,
        launch_configuration.dynamic_shared_memory_size,
        stream_id);
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM