简体   繁体   English

OpenMP C ++:并行化for循环的负载不平衡

[英]OpenMP C++: Load imbalance with parallelised for loop

I'm trying to parallelise a for loop, and I'm encountering undesired behaviour. 我正在尝试并行处理for循环,并且遇到了不良行为。 The loop calls a taxing function (which contains another for loop) and then prints the result. 该循环调用一个收税功能(包含另一个for循环),然后打印结果。 I've parallelised the loop using #pragma omp parallel for . 我使用#pragma omp parallel for来并行化循环。

The behaviour I'm seeing is: the CPU gets fully utilised at the start, and then near the end, it suddenly drops back down to 25% utilisation. 我看到的行为是:CPU在开始时就被充分利用,然后在接近尾声时,它突然回落到25%的利用率。 My guess is that one task gets allocated to a thread, and then as most of the tasks get completed, the system waits for the newer ones to complete. 我的猜测是将一个任务分配给一个线程,然后随着大多数任务的完成,系统会等待较新的任务完成。 Though if that were the case, I would've seen drops to 75%, 50%, and then 25%, but no, it drops straight to 25%. 即使是这种情况,我也会看到分别下降到75%,50%和25%,但没有,它会直接下降到25%。

I've tried to parallelise the function itself, but it made no difference. 我试图并行化函数本身,但是没有区别。 Removing the parallelisation on the loop resulted in a behaviour where usage would spike to 100%, then drop to 25%, and then repeat like that throughout execution, which resulted in even worse performance than before. 拆卸循环的并行化导致了行为,其中使用量就会猛增至100%,再下降到25%,然后重复一样,在整个执行,这就造成了比以前表现更差。 I also tried a bunch of other options for the for loop like schedule. 我还为for循环尝试了许多其他选项,例如时间表。

How would I be able to assign unused threads to the last newly created tasks? 如何将未使用的线程分配给最近创建的任务? Or is something like this not possible in OpenMP? 还是在OpenMP中是不可能的?

If your guess is correct, then you should apply schedule(dynamic) to your loop, which has the following effect: 如果您的猜测是正确的,则应将schedule(dynamic)应用于循环,这将产生以下效果:

When kind is dynamic , the iterations are distributed to threads in the team in chunks. kind动态的时 ,迭代将以块的形式分配给团队中的线程。 Each thread executes a chunk of iterations, then requests another chunk, until no chunks remain to be distributed. 每个线程执行一个迭代块,然后请求另一个块,直到没有块要分配为止。 Each chunk contains chunk_size iterations, except for the chunk that contains the sequentially last iteration, which may have fewer iterations. 每个块均包含chunk_size迭代,但包含顺序上一次迭代的块除外,后者可能具有较少的迭代。 When no chunk_size is specified, it defaults to 1. 如果未指定chunk_size ,则默认为1。

You can also experiment with increasing the chunk_size (eg, schedule(dynamic,16) ) or using schedule(guided) : 您也可以尝试增加chunk_size (例如schedule(dynamic,16) )或使用schedule(guided)

When kind is guided , the iterations are assigned to threads in the team in chunks. kind引导时 ,迭代将按组分配给团队中的线程。 Each thread executes a chunk of iterations, then requests another chunk, until no chunks remain to be assigned. 每个线程执行一个迭代块,然后请求另一个块,直到没有块要分配为止。 For a chunk_size of 1, the size of each chunk is proportional to the number of unassigned iterations divided by the number of threads in the team, decreasing to 1. [...] 对于chunk_size为1,每个块的大小与未分配的迭代数除以团队中的线程数成正比,减小到1。[...]

Take a look at this answer for a detailed discussion about dynamic vs guided schdules. 查看此答案 ,以获取有关dynamicguided时间表的详细讨论。

In general, I recommend to never guess about performance . 通常,我建议不要猜测性能 Use a sophisticated performance analysis tool that understands OpenMP and can tell you about the actual potential for optimization in your code. 使用能够理解OpenMP并可以告诉您代码中进行优化的潜在潜力的复杂性能分析工具。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM