Cuda：内核启动队列

Question

I'm not finding much info on the mechanics of a kernel launch operation. 我没有找到关于内核启动操作机制的更多信息。 The API say to see the CudaProgGuide . API说要看CudaProgGuide 。 And I'm not finding much there either. 我也找不到那么多。
Being that kernel execution is asynch, and some machines support concurrent execution, I'm lead to believe there is a queue for the kernels. 由于内核执行是异步的，并且一些机器支持并发执行，因此我认为内核有一个队列。

    Host code:      
    1. malloc(hostArry, ......);  
    2. cudaMalloc(deviceArry, .....);  
    3. cudaMemcpy(deviceArry, hostArry, ... hostToDevice);
    4. kernelA<<<1,300>>>(int, int);  
    5. kernelB<<<10,2>>>(float, int));  
    6. cudaMemcpy(hostArry, deviceArry, ... deviceToHost);  
    7. cudaFree(deviceArry);

Line 3 is synchronous. 第3行是同步的。 Line 4 & 5 are asynchronous, and the machine supports concurrent execution. 第4行和第5行是异步的，并且机器支持并发执行。 So at some point, both of these kernels are running on the GPU. 所以在某些时候，这两个内核都在GPU上运行。 (There is the possibility that kernelB starts and finishes, before kernelA finishes.) While this is happening, the host is executing line 6. Line 6 is synchronous with respect to the copy operation, but there is nothing preventing it from executing before kernelA or kernelB has finished. （在kernelA完成之前，kernelB有可能启动并完成。）当发生这种情况时，主机正在执行第6行。第6行与复制操作是同步的，但没有什么能阻止它在kernelA之前执行或者kernelB已经完成了。

1) Is there a kernel queue in the GPU? 1）GPU中是否有内核队列？ (Does the GPU block/stall the host?) （GPU是否阻止/停止主机？）
2) How does the host know that the kernel has finished, and it is "safe" to Xfer the results from the device to the host? 2）主机如何知道内核已经完成，并且将设备的结果发送给主机是“安全的”？

Answer 1

Yes, there are a variety of queues on the GPU, and the driver manages those. 是的，GPU上有各种队列，驱动程序管理这些队列。

Asynchronous calls return more or less immediately. 异步调用或多或少立即返回。 Synchronous calls do not return until the operation is complete. 在操作完成之前，同步调用不会返回。 Kernel calls are asynchronous. 内核调用是异步的。 Most other CUDA runtime API calls are designated by the suffix Async if they are asynchronous. 大多数其他CUDA运行时API调用由后缀Async指定，如果它们是异步的。 So to answer your question: 所以回答你的问题：

1) Is there a kernel queue in the GPU? 1）GPU中是否有内核队列？ (Does the GPU block/stall the host?) （GPU是否阻止/停止主机？）

There are various queues. 有各种队列。 The GPU blocks/stalls the host on a synchronous call, but the kernel launch is not a synchronous operation. GPU在同步调用中阻塞/停止主机，但内核启动不是同步操作。 It returns immediately, before the kernel has completed, and perhaps before the kernel has even started. 它在内核完成之前立即返回，也许在内核启动之前返回。 When launching operations into a single stream, all CUDA operations in that stream are serialized . 将操作启动到单个流中时，该流中的所有CUDA操作都将被序列化 。 Therefore, even though kernel launches are asynchronous, you will not observed overlapped execution for two kernels launched to the same stream, because the CUDA subsystem guarantees that a given CUDA operation in a stream will not start until all previous CUDA operations in the same stream have finished. 因此，即使内核启动是异步的，您也不会发现启动到同一个流的两个内核的重叠执行，因为CUDA子系统保证流中的给定CUDA操作在相同流中的所有先前CUDA操作之前都不会启动完了。 There are other specific rules for the null stream (the stream you are using if you don't explicitly call out streams in your code) but the preceding description is sufficient for understanding this question. 对于空流（如果您未在代码中明确地调用流，则使用的流）还有其他特定规则，但前面的描述足以理解此问题。

2) How does the host know that the kernel has finished, and it is "safe" to Xfer the results from the device to the host? 2）主机如何知道内核已经完成，并且将设备的结果发送给主机是“安全的”？

Since the operation that transfers results from the device to the host is a CUDA call (cudaMemcpy...), and it is issued in the same stream as the preceding operations, the device and CUDA driver manage the execution sequence of cuda calls so that the cudaMemcpy does not begin until all previous CUDA calls issued to the same stream have completed. 由于将结果从设备传输到主机的操作是CUDA调用（cudaMemcpy ...），并且它在与前面操作相同的流中发出，因此设备和CUDA驱动程序管理cuda调用的执行顺序，以便在发出到同一个流的所有先前CUDA调用完成之前，cudaMemcpy才会开始。 Therefore a cudaMemcpy issued after a kernel call in the same stream is guaranteed not to start until the kernel call is complete, even if you use cudaMemcpyAsync . 因此，即使您使用cudaMemcpyAsync ，在内核调用完成之后，在相同流中的内核调用之后发出的cudaMemcpy也会保证不会启动。

Answer 2

You can use cudaDeviceSynchronize() after a kernel call to guarantee that all previous tasks requested to the device has been completed. 您可以在内核调用后使用cudaDeviceSynchronize（）来保证为设备请求的所有先前任务都已完成。 If the results of kernelB are independent from the results on kernelA, you can set this function right before the memory copy operation. 如果kernelB的结果与kernelA上的结果无关，则可以在内存复制操作之前设置此函数。 If not, you will need to block the device before calling kernelB, resulting in two blocking operations. 如果没有，则需要在调用kernelB之前阻塞设备，从而导致两个阻塞操作。

Cuda：内核启动队列

问题描述

2 个解决方案

解决方案1
4 2013-07-14 01:44:47

解决方案2
0 2012-10-05 20:57:31

Cuda：内核启动队列

问题描述

2 个解决方案

解决方案1 4 2013-07-14 01:44:47

解决方案2 0 2012-10-05 20:57:31

解决方案1
4 2013-07-14 01:44:47

解决方案2
0 2012-10-05 20:57:31