简体   繁体   中英

OpenCL: loop kernel?

I'm running an OpenCL kernel that processes and re-processes the same data set over and over (it's an iterative physics solver).

In my tests, there is a non-trivial cost to calling clEnqueueNDRangeKernel. For example, when running 1000 substeps of the simulation (requiring 1000 identical calls to clEnqueueNDRangeKernel to process the same data), it seems that those calls to clEnqueueNDRangeKernel actually become the bottleneck. My (pseudo)code looks like this:

[create buffers]
[set kernel arguments]

for (int i = 0; i < 1000; i++) //queuing the kernels takes a while
{
    clEnqueueNDRangeKernel(queue, kernel, args...); 
}

clFinish(queue); //waiting for the queue to complete doesn't take much time
[read buffers]

I understand that the first call to clEnqeueuNDRangeKernel will initialize any deferred buffer transfers to the GPU...so the first call can have an additional cost. However, in my tests, a loop of 10 iterations is substantially faster than 1000 iterations, which leads me to believe the data transfer is not the bottleneck.

I'm also under the impression that clEnqueueNDRangeKernel is non-blocking in the sense that it won't block until the kernel is complete, so the complexity of the kernel shouldn't be the bottleneck (in my case, the kernel execution shouldn't block until the call to clFinish()).

However, when I profiled my code, the majority of time is spent merely processing the for loop, prior to the call to clFinish()...so it seems that the queuing of the kernels themselves is what's taking the most time here.

My question : is there a way to tell the GPU to re-run a previously-queued kernel N times, rather than having to manually queue a kernel N times? In my situation, no arguments for the kernel need to be changed or updated each iteration...the kernel just needs to be re-run. Can repeated calls to it be made more efficient?

OpenCL 2.x supports dynamic parallelism which lets 1 workitem launch new kernels. If each kernel launch doesn't need any gpu-cpu data transfer, you can have 1 workitem launch 1000 kernels and wait for each one to finish by that workitem. Use events to make all child kernels run after another.

In OpenCL 1.2, you could use atomics and a loop to do "in-flight threads" kernel synchronization but that wouldn't be faster than a new kernel launch imo and it is not a portable way of synchronizing them.

If each kernel launch costs more time than each kernel run, then there is not enough work done on GPU. You can simply do c=a+b on GPU and that won't be fast enough just because of kernel scheduling on gpu pipelines needing more time than doing c=a+b.

But, you can still do following approach using clEnqueueWaitForEvents and in-order command queues:

thread-0:
    enqueue user event, not triggered
    enqueue 1000 kernels, they don't start yet because of untriggered wait
thread-1:
    nothing

next time step:

thread-0:
    enqueue new user event on a new command queue, not triggered
    enqueue 1000 kernels on new command queue so they don't start yet
thread-1:
    run the old command queue from last timestep by triggering the user event

so that enqueueing and running can be "overlapped" at least. If you need more enqueue to run overlapping ratio,

thread-0 to thread-N-2:
    enqueue new user event on a new command queue, not triggered
    enqueue 1000 kernels on new command queue so they don't start yet
thread-N-1:
    iterate all command queues
          run currently selected command queue from last timestep  by triggering the user event

now that you have N-1 times faster enqueueing, running them would be only on GPU-side scheduling overhead. If you have GPU-side scheduling overhead much (1M workitems for 1M c=a+b calcs) then you should do more work per workitem.

Maybe making a producer-consumer style kernel launch is better where 7 threads produce filled command queues waiting to be triggered on their own user events and 8. thread consuming them by triggering them. This would work even if they need to upload download data to/from GPU.

Even old GPUs like HD7870 support 32+ command queues(per GPU) at the same time so you can scale enqueueing performane up by high-end CPUs.

If pci-e bridge(high latency by risers?) is causing bottleneck, then OpenCL2.x dynamic parallelism must be better than CPU-side producer-consumer pattern.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM