[英]CUDA optimisation - kernel launch conditions
I am fairly new to CUDA and would like to find out more about optimising kernel launch conditions to speed up my code. 我是CUDA的新手,我想了解更多有关优化内核启动条件以加快代码速度的信息。 This is quite a specific scenario but I'll try to generalise it as much as possible so anyone else with a similar question can gain from this in the future. 这是一个非常特定的场景,但我将尝试对其进行尽可能的概括,以便将来任何有类似问题的人都可以从中受益。
Assume I've got an array of 300 elements (Array A) that is sent to the kernel as an input. 假设我有一个300个元素的数组(数组A)作为输入发送到内核。 This array is made of a few repeating integers with each integer having a device function specific to it. 该数组由几个重复的整数组成,每个整数都有特定的设备功能。 For example, every time 5 appears in Array A, the kernel performs the function specific to 5. These functions are device functions. 例如,每次在数组A中出现5时,内核就会执行特定于5的功能。这些功能是设备功能。
How I have parallelised this problem is by launching 320 blocks (probably not the best number) so that each block will perform the device function relevant to its element in parallel. 我如何并行化此问题是通过启动320个块(可能不是最佳数量),以便每个块将并行执行与其元素相关的设备功能。
The CPU would handle the entire problem in a serial fashion where it will take element by element and call each function one after the other whereas the GPU would allocate an element to each block so that all 320 blocks can access the relevant device functions and calculate simultaneously. CPU将以串行方式处理整个问题,它将逐个元素地调用每个函数,而GPU将为每个块分配一个元素,以便所有320个块可以访问相关的设备函数并同时进行计算。
In theory for a large number of elements the GPU should be faster - at least I though so but in my case it isn't. 从理论上讲,对于大量元素,GPU应该更快-至少我是这样,但就我而言不是。 My assumption is that since 300 elements is a small number the CPU will always be faster than the GPU. 我的假设是,由于300个元素很小,因此CPU总是比GPU快。
This is acceptable BUT what I want to know is how I can cut down the GPU execution time at least by a little. 这是可以接受的,但我想知道的是我如何可以至少减少GPU执行时间。 Currently, the CPU takes 2.5 milliseconds and the GPU around 12 ms. 目前,CPU需要2.5毫秒,GPU需要12毫秒左右。
Question 1 - How can I choose the optimum number of blocks/threads to launch at the start? 问题1-如何选择最佳数量的块/线程在开始时启动? First I tried 320 blocks with 1 thread per block. 首先,我尝试了320个块,每个块1个线程。 Then 1 block with 320 threads. 然后用320个线程阻塞1个。 No real change in execution time. 执行时间没有真正的改变。 Will tweaking the number of blocks/threads improve the speed? 调整块/线程数会提高速度吗?
Question 2 - If 300 elements is too small, why is that, and roughly how many elements do I need to see the GPU outperforming the CPU? 问题2-如果300个元素太小,为什么会这样,要让GPU胜过CPU,我大约需要多少个元素?
Question 3 - What optimisation techniques should I look into? 问题3-我应该研究哪些优化技术?
Please let me know if any of this isn't that clear and I'll expand on it. 如果有任何不清楚的地方,请告诉我,我将继续进行扩展。
Thanks in advance. 提前致谢。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.