简体繁体 English

OpenCL内核上的GPU性能降低

[英]Slow GPU performance on OpenCL kernel

原文 2017-07-13 11:12:26 5 1 performance/ opencl/ ati/ codexl

I'm kinda at a lost with some performance with OpenCL on an AMD GPU (Hawaii core or Radeon R9 390). 我有点迷失于在AMD GPU（夏威夷核或Radeon R9 390）上使用OpenCL的性能。

The operation is as follows: 操作如下：

send memory object #1 to GPU 将内存对象＃1发送到GPU
execute kernel #1 执行内核＃1
send memory object #2 to GPU 将内存对象2发送到GPU
execute kernel #2 执行内核＃2
send memory object #3 to GPU 将内存对象3发送到GPU
execute kernel #3 执行内核＃3

dependency is: 依赖关系是：

kernel #1 on memory object #1 内存对象＃1上的内核＃1
kernel #2 on memory object #2 as well as output memory of kernel #1 内存对象＃2上的内核＃2以及内核＃1的输出内存
kernel #3 on memory object #3 as well as output memory of kernels #1 & #2 内存对象＃3上的内核＃3以及内核＃1和＃2的输出内存

Memory transmission and Kernel execute are performed in two separate command queues. 内存传输和内核执行在两个单独的命令队列中执行。 Command dependency is done by GPU events as defined in OpenCL. 命令依赖关系由OpenCL中定义的GPU事件完成。

The whole operation is now looped just for performance analysis with the same input data. 现在，整个操作循环进行，仅用于使用相同的输入数据进行性能分析。

As you can see in the timeline, the host is waiting a very long time on the GPU to finish with clWaitForEvents() while the GPU idles most of the time. 从时间轴上可以看到，主机在GPU上等待了很长时间才能完成clWaitForEvents（）的处理，而GPU大部分时间都处于空闲状态。 You can also see the repeated operation. 您还可以看到重复的操作。 For convenience I also provide the list of all issued OpenCL commands. 为了方便起见，我还提供了所有已发布的OpenCL命令的列表。

My questions now are: 我的问题是：

Why is the GPU idling so much? 为什么GPU空转这么多？ In my head I can easily push all "blue" items together and start the operation right away. 我可以轻松地将所有“蓝色”物品推到一起，然后立即开始操作。 Memory transfer is 6 GB/s which is the expected rate. 内存传输为6 GB / s，这是预期的速率。
Why are the kernels executed so late? 为什么内核执行得这么晚？ Why is there a gap between kernel #2 and kernel #3 execution? 为什么在内核2和内核3执行之间存在差距？
Why are memory transfer and kernel not executed in parallel? 为什么内存传输和内核不并行执行？ I use 2 command queues, with only 1 queue it is even worse with performance. 我使用2个命令队列，只有1个队列，性能甚至更差。

Just by pushing all commands together in my head (keeping dependency of course, so 1st green must start after 1st blue) I can triple performance. 只需将所有命令推到脑海（当然要保持依赖性，所以第一个绿色必须在第一个蓝色之后开始），我可以使性能提高三倍。 I don't know why the GPU is so sluggish. 我不知道为什么GPU这么慢。 Has anyone some insight? 有没有人有见识？

Some number crunching 一些数字运算

Memory Transfer #1 is 253 µs 内存传输1为253 µs
Memory Transfer #2 is 120 µs 内存传输2为120 µs
Memory Transfer #3 is 143 µs -- which is always too high for unknown reasons, it should be about 1/2 of #2 or in range 70-80 µs ＃3的内存传输为143 µs-由于未知原因，该值始终过高，应约为＃2的1/2或70-80 µs的范围
Kernel #1 is 74 µs 内核＃1为74 µs
Kernel #2 is 95 µs 内核2为95 µs
Kernel #3 is 107 µs ＃3内核为107 µs

as Kernel #1 is faster than Memory Transfer #2 and Kernel #2 is faster than Memory Transfer #3 overall time should be: 由于内核＃1快于内存传输＃2，内核＃2快于内存传输＃3，总体时间应该是：

253 µs + 120 µs + 143 µs + 107 µs = 623 µs 253 µs + 120 µs + 143 µs + 107 µs = 623 µs

but clWaitForEvents is 但是clWaitForEvents是

1758 µs -- or about 3x as much 1758 µs-大约是原来的3倍

Yes, there are some losses and I'm fine with like 10% (60 µs), but 300% is too much. 是的，有一些损失，我可以接受10％（60 µs）的罚款，但300％的罚款太多了。

1 个解决方案

As @DarkZeros has said, you need to hide kernel-enqueue overhead by using multiple command queues to overlap them in time-line. 正如@DarkZeros所说，您需要通过使用多个命令队列在时间轴上重叠它们来隐藏内核入队开销。

Why is the GPU idling so much? 为什么GPU空转这么多？

Because you are using 2 command queues and they are running serially (probably) with events that make them wait longer. 因为您使用的是2个命令队列，并且它们连续（可能）运行并带有使它们等待更长时间的事件。

You should use single queue if everything is serial. 如果所有内容都是串行的，则应使用单个队列。 You should let two queues overlap actions if you can add double-buffering or similar techniques to advance computations. 如果可以添加双缓冲或类似技术来推进计算，则应让两个队列重叠操作。

Why are the kernels executed so late? 为什么内核执行得这么晚？

The wide holes consist of host-side latencies such as enqueueing commands, flushing commands to device, host-side algorithms and device-side event control logic. 漏洞包括主机方等待时间，例如排队命令，对设备的刷新命令，主机方算法和设备方事件控制逻辑。 Maybe events can get as small as 20-30 microseconds but host-device interactons are more than that. 也许事件可以小到20到30微秒，但主机与设备的交互作用远不止于此。

If you get rid of events and use single queue, drivers may even add early compute techniques to fill those gaps even before you enqueue those commands(maybe) just as CPUs do early branching(predicting). 如果您摆脱了事件并使用单个队列，则驱动程序甚至可以在添加队列（甚至）之前就添加早期计算技术来填补这些空白，就像CPU进行早期分支（预测）一样。

Why are memory transfer and kernel not executed in parallel? 为什么内存传输和内核不并行执行？

There is no enforcement but drivers can also check dependencies between kernels and copies and to keep the data intact, they can halt some operations until some others finish (maybe). 没有强制执行，但驱动程序还可以检查内核与副本之间的依赖性，并保持数据完整，它们可以停止某些操作，直到其他操作完成（也许）为止。

Are you sure kernels and buffer copies are completely independent? 您确定内核和缓冲区副本完全独立吗？

Another reason could be two queues don't have much to choose to overlap. 另一个原因可能是两个队列没有太多选择可供选择。 If both queues have both types of operations, they would have more options to overlap such as kernel + kernel, copy + copy instead of just kernel+copy. 如果两个队列都具有两种类型的操作，则它们将有更多选项可以重叠，例如内核+内核，复制+复制而不是仅内核+复制。

If program has too many small kernels, you may try OpenCL 2.0 dynamic parallelism which makes device call kernels itself which is faster than host-side enqueue. 如果程序中有太多小内核，则可以尝试使用OpenCL 2.0动态并行机制，该机制使设备调用内核本身比主机端入队要快。

Maybe you can add a higher level parallelism such as image-level parallelism (if its image processing you do) to keep gpu busy. 也许您可以添加更高级别的并行性，例如图像级并行性（如果您要进行图像处理），以使gpu保持忙碌。 Work on 5-10 images at the same time which should ensure independent kernel/buffer executions unless all images are in same buffer. 同时处理5-10个映像，这应确保独立执行内核/缓冲区，除非所有映像都在同一缓冲区中。 If that doesn't work, then you can launch 5-10 processes of same program(process level parallelism). 如果这不起作用，则可以启动5-10个相同程序的进程（进程级并行性）。 But having too many contexts can stuck at driver limitations so image level parallelism must be better. 但是，上下文过多可能会限制驱动程序，因此图像级并行性必须更好。

R9 390 must be able to process with 8-16 command queues. R9 390必须能够处理8-16个命令队列。

1758 µs 1758微秒

Sometimes even empty kernels make it wait for 500-100 µs. 有时，即使是空内核也要等待500-100 µs。 Most probably you should enqueue 1000 cycles, wait once at the end. 最有可能您应排队1000个循环，最后等待一次。 If each cycle works after a user-button-click, then user wouldn't notice the 1.7 ms latency already. 如果每个周期都在用户单击按钮后起作用，则用户将不会注意到1.7 ms的延迟。

Use many queues. 使用许多队列。
Get rid of events between queues(if any). 摆脱队列之间的事件（如果有）。
Have each queue do all kinds of work. 让每个队列执行各种工作。
Have many iterations before a single wait for event on host side. 在主机端一次等待事件之前，需要进行多次迭代。
If OpenCL 2.0 exists, try device-side enqueue too, but that works only for kernel executions, not for copies to/from host. 如果存在OpenCL 2.0，也可以尝试在设备端排队，但这仅适用于内核执行，不适用于与主机之间的复制。