简体繁体 English

CUDA优化-内核启动条件

[英]CUDA optimisation - kernel launch conditions

原文 2013-07-17 21:31:52 8 1 optimization/ cuda

I am fairly new to CUDA and would like to find out more about optimising kernel launch conditions to speed up my code. 我是CUDA的新手，我想了解更多有关优化内核启动条件以加快代码速度的信息。 This is quite a specific scenario but I'll try to generalise it as much as possible so anyone else with a similar question can gain from this in the future. 这是一个非常特定的场景，但我将尝试对其进行尽可能的概括，以便将来任何有类似问题的人都可以从中受益。

Assume I've got an array of 300 elements (Array A) that is sent to the kernel as an input. 假设我有一个300个元素的数组（数组A）作为输入发送到内核。 This array is made of a few repeating integers with each integer having a device function specific to it. 该数组由几个重复的整数组成，每个整数都有特定的设备功能。 For example, every time 5 appears in Array A, the kernel performs the function specific to 5. These functions are device functions. 例如，每次在数组A中出现5时，内核就会执行特定于5的功能。这些功能是设备功能。

How I have parallelised this problem is by launching 320 blocks (probably not the best number) so that each block will perform the device function relevant to its element in parallel. 我如何并行化此问题是通过启动320个块（可能不是最佳数量），以便每个块将并行执行与其元素相关的设备功能。

The CPU would handle the entire problem in a serial fashion where it will take element by element and call each function one after the other whereas the GPU would allocate an element to each block so that all 320 blocks can access the relevant device functions and calculate simultaneously. CPU将以串行方式处理整个问题，它将逐个元素地调用每个函数，而GPU将为每个块分配一个元素，以便所有320个块可以访问相关的设备函数并同时进行计算。

In theory for a large number of elements the GPU should be faster - at least I though so but in my case it isn't. 从理论上讲，对于大量元素，GPU应该更快-至少我是这样，但就我而言不是。 My assumption is that since 300 elements is a small number the CPU will always be faster than the GPU. 我的假设是，由于300个元素很小，因此CPU总是比GPU快。

This is acceptable BUT what I want to know is how I can cut down the GPU execution time at least by a little. 这是可以接受的，但我想知道的是我如何可以至少减少GPU执行时间。 Currently, the CPU takes 2.5 milliseconds and the GPU around 12 ms. 目前，CPU需要2.5毫秒，GPU需要12毫秒左右。

Question 1 - How can I choose the optimum number of blocks/threads to launch at the start? 问题1-如何选择最佳数量的块/线程在开始时启动？ First I tried 320 blocks with 1 thread per block. 首先，我尝试了320个块，每个块1个线程。 Then 1 block with 320 threads. 然后用320个线程阻塞1个。 No real change in execution time. 执行时间没有真正的改变。 Will tweaking the number of blocks/threads improve the speed? 调整块/线程数会提高速度吗？

Question 2 - If 300 elements is too small, why is that, and roughly how many elements do I need to see the GPU outperforming the CPU? 问题2-如果300个元素太小，为什么会这样，要让GPU胜过CPU，我大约需要多少个元素？

Question 3 - What optimisation techniques should I look into? 问题3-我应该研究哪些优化技术？

Please let me know if any of this isn't that clear and I'll expand on it. 如果有任何不清楚的地方，请告诉我，我将继续进行扩展。

Thanks in advance. 提前致谢。

1 个解决方案

Internally, CUDA manages threads in groups of 32 (so-called warps). 在内部，CUDA以32个组的方式管理线程（所谓的warp）。 If you have 1 thread per block device will still execute 32 of those - 31 thread will simply be in divergent state. 如果每个块有1个线程，设备仍将执行其中的32个-31个线程将仅处于发散状态。 This is potentially an occupancy issue though you may not observe it on your device and with your problem size. 这可能是一个占用问题，尽管您可能无法在设备上以及问题的大小上看到它。 There is also limit on number of blocks given multiprocessor (SM) can execute. 给定多处理器（SM）可以执行的块数也受到限制。 AFAIR, GeForce 4x can run up to 8 blocks on one SM. AFAIR，GeForce 4x在一台SM上最多可以运行8个块。 Hence if you have a device with 8 SMs you can simultaneously run 64 threads if you have block size of 1. You can use a tool called occupancy calculator to estimate a better block size - or you can use a visual profiler. 因此，如果您的设备具有8个SM，并且块大小为1，则可以同时运行64个线程。可以使用称为占用率计算器的工具来估计更好的块大小-或可以使用可视分析器。
This can only be decided by profiling. 这只能通过分析来确定。 There are too many unknowns - eg what is your ratio of memory accesses to actual computations, how parallelizable your task is, etc. 未知数太多-例如，您的内存访问量与实际计算的比率是多少，任务的并行性如何，等等。
I would really recommend you to start with best practices guide . 我真的建议您从最佳做法指南入手。