简体   繁体   中英

How can I make CUDA return control after kernel launch?

It might be a stupid question but is there a way to return asynchronously from a kernel? For example, I have this kernel which does a first stream compaction which is outputted to the user but before it must do a second stream compaction to update its internal structure.

Is there a way to return the control to the user after the first stream compaction done while the GPU continues its second stream compaction in the background? Of course, the second stream compaction works only on shared memory and global memory, but nothing the user should retrieve.

I can't use thrust.

A GPU kernel does not, in itself, take control from the "user", ie from CPU threads on the system with the GPU.

However, with CUDA's runtime, the default way to invoke a GPU kernel has your thread wait until the kernel's execution concludes:

my_kernel<<<my_grid_dims,my_block_dims,dynamic_shared_memory_size>>>(args,go,here);

but you can also use streams . These are hardware-supported execution queues on which you can enqueue work (memory copying, kernel execution etc.) asynchronously , just like you asked.

Your launch in this case may look like:

cudaStream_t my_stream;
cudaError_t result = cudaStreamCreateWithFlags(&my_stream, cudaStreamNonBlocking);  
if (result != cudaSuccess) { /* error handling */ }

my_kernel<<<my_grid_dims,my_block_dims,dynamic_shared_memory_size,my_stream>>>(args,go,here);

There are lots of resources on using streams; try this blog post for starters. The CUDA programming guide has a larg section on asynchronous execution .

Streams and various libraries

Thrust has offered asynchronous functionality for a while, using thrust::future and other constructs. See here .

My own Modern-C++ CUDA API wrappers make it somewhat easier to work with streams, relieving you of the need to check for errors all the time and to remember to destroy streams and release memory before it goes out of scope. make it somewhat easier to work with streams. See this example ; the syntax looks something like this:

auto stream = device.create_stream(cuda::stream::async);
stream.enqueue.copy(d_a.get(), a.get(), nbytes);
stream.enqueue.kernel_launch(my_kernel, launch_config, d_a.get(), more, args);

(and errors throw an exception)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM