简体繁体 English

OpenMP C ++：并行化for循环的负载不平衡

[英]OpenMP C++: Load imbalance with parallelised for loop

原文 2019-09-11 04:18:08 3 1 c++/ openmp

I'm trying to parallelise a for loop, and I'm encountering undesired behaviour. 我正在尝试并行处理for循环，并且遇到了不良行为。 The loop calls a taxing function (which contains another for loop) and then prints the result. 该循环调用一个收税功能（包含另一个for循环），然后打印结果。 I've parallelised the loop using #pragma omp parallel for . 我使用#pragma omp parallel for来并行化循环。

The behaviour I'm seeing is: the CPU gets fully utilised at the start, and then near the end, it suddenly drops back down to 25% utilisation. 我看到的行为是：CPU在开始时就被充分利用，然后在接近尾声时，它突然回落到25％的利用率。 My guess is that one task gets allocated to a thread, and then as most of the tasks get completed, the system waits for the newer ones to complete. 我的猜测是将一个任务分配给一个线程，然后随着大多数任务的完成，系统会等待较新的任务完成。 Though if that were the case, I would've seen drops to 75%, 50%, and then 25%, but no, it drops straight to 25%. 即使是这种情况，我也会看到分别下降到75％，50％和25％，但没有，它会直接下降到25％。

I've tried to parallelise the function itself, but it made no difference. 我试图并行化函数本身，但是没有区别。 Removing the parallelisation on the loop resulted in a behaviour where usage would spike to 100%, then drop to 25%, and then repeat like that throughout execution, which resulted in even worse performance than before. 拆卸循环的并行化导致了行为，其中使用量就会猛增至100％，再下降到25％，然后重复一样，在整个执行，这就造成了比以前表现更差。 I also tried a bunch of other options for the for loop like schedule. 我还为for循环尝试了许多其他选项，例如时间表。

How would I be able to assign unused threads to the last newly created tasks? 如何将未使用的线程分配给最近创建的任务？ Or is something like this not possible in OpenMP? 还是在OpenMP中是不可能的？

1 个解决方案

If your guess is correct, then you should apply schedule(dynamic) to your loop, which has the following effect: 如果您的猜测是正确的，则应将schedule(dynamic)应用于循环，这将产生以下效果：

When kind is dynamic , the iterations are distributed to threads in the team in chunks. 当kind是动态的时 ，迭代将以块的形式分配给团队中的线程。 Each thread executes a chunk of iterations, then requests another chunk, until no chunks remain to be distributed. 每个线程执行一个迭代块，然后请求另一个块，直到没有块要分配为止。 Each chunk contains chunk_size iterations, except for the chunk that contains the sequentially last iteration, which may have fewer iterations. 每个块均包含chunk_size迭代，但包含顺序上一次迭代的块除外，后者可能具有较少的迭代。 When no chunk_size is specified, it defaults to 1. 如果未指定chunk_size ，则默认为1。

You can also experiment with increasing the chunk_size (eg, schedule(dynamic,16) ) or using schedule(guided) : 您也可以尝试增加chunk_size （例如schedule(dynamic,16) ）或使用schedule(guided) ：

When kind is guided , the iterations are assigned to threads in the team in chunks. 当kind被引导时 ，迭代将按组分配给团队中的线程。 Each thread executes a chunk of iterations, then requests another chunk, until no chunks remain to be assigned. 每个线程执行一个迭代块，然后请求另一个块，直到没有块要分配为止。 For a chunk_size of 1, the size of each chunk is proportional to the number of unassigned iterations divided by the number of threads in the team, decreasing to 1. [...] 对于chunk_size为1，每个块的大小与未分配的迭代数除以团队中的线程数成正比，减小到1。[...]

Take a look at this answer for a detailed discussion about dynamic vs guided schdules. 查看此答案，以获取有关dynamic与guided时间表的详细讨论。

In general, I recommend to never guess about performance . 通常，我建议不要猜测性能 。 Use a sophisticated performance analysis tool that understands OpenMP and can tell you about the actual potential for optimization in your code. 使用能够理解OpenMP并可以告诉您代码中进行优化的潜在潜力的复杂性能分析工具。