OpenMP并行峰值

Question

I'm using OpenMP in Visual Studio 2010 to speed up loops. 我在Visual Studio 2010中使用OpenMP来加速循环。

I wrote a very simple test to see the performance increase using OpenMP. 我编写了一个非常简单的测试，以了解使用OpenMP可以提高性能。 I use omp parallel on an empty loop 我在空循环中使用omp parallel

int time_before = clock();

#pragma omp parallel for
for(i = 0; i < 4; i++){

}

int time_after = clock();

std::cout << "time elapsed: " << (time_after - time_before) << " milliseconds" << std::endl;

Without the omp pragma it consistently takes 0 milliseconds to complete (as expected), and with the pragma it usually takes 0 as well. 如果没有omp编译指示，则持续需要0毫秒才能完成（如预期的那样），而使用pragma编译则通常也需要0毫秒。 The problem is that with the opm pragma it spikes occasionally, anywhere from 10 to 32 milliseconds. 问题在于opm编译指示偶尔会在10到32毫秒之间达到峰值。 Every time I tried parallel with OpenMP I get these random spikes, so I tried this very basic test. 每次与OpenMP并行尝试时，都会得到这些随机峰值，因此我尝试了这一非常基本的测试。 Are the spikes an inherent part of OpenMP, or can they be avoided? 峰值是OpenMP的固有部分，还是可以避免？

The parallel for gives me great speed boosts on some loops, but these random spikes are too big for me to be able to use it. 并行的可以在某些循环上极大地提高速度，但是这些随机峰值对于我来说太大了，无法使用它。

Answer 1

Thats pretty normal behiavor. 那是很正常的行为。 Sometimes your operation system is busy and need more time to spawn new threads. 有时您的操作系统很忙，需要更多时间来产生新线程。

Answer 2

I want to complement the answer of kukis: I'd also say, that the reason for the spikes are due to the additional overhead that comes with OpenMP. 我想补充kukis的答案：我还要说，峰值的原因是OpenMP带来的额外开销。

Furthermore, as you are doing performance-sensitive measurements, I hope that you compiled your code with optimizations turned on. 此外，在进行性能敏感的测量时，希望您在启用优化的情况下编译代码。 In that case, the loop without OpenMP simply gets optimized out by the compiler, so there is no code in between time_before and time_after . 在那种情况下，不带OpenMP的循环只会被编译器优化，因此time_before和time_after之间没有代码。 With OpenMP, however, at least g++ 4.8.1 ( -O3 ) is unable to optimize the code: The loop is still there in the assembler, and contains additional statements to manage the work-sharing. 但是，使用OpenMP，至少g ++ 4.8.1（ -O3 ）无法优化代码：循环仍在汇编器中，并且包含其他语句来管理工作共享。 (I cannot try it with VS at the moment.) （我目前无法在VS上尝试。）

So, the comparison is not really fair, as the one without OpenMP gets optimized out completely. 因此，这种比较并不十分公平，因为没有OpenMP的比较已被完全优化。

Edit: You also have to keep in mind, that OpenMP doesn't re-create threads everytime. 编辑：您还必须记住，OpenMP不会每次都重新创建线程。 Rather it uses a thread-pool. 而是使用线程池。 So, if you execute an omp-construct before your loop, the threads will already be created when it encounters another one: 因此，如果在循环之前执行omp-construct，则在遇到另一个线程时将已经创建线程：

// Dummy loop: Spawn the threads.
#pragma omp parallel for
for(int i = 0; i < 4; i++){
}

int time_before = clock();

// Do the actual measurement. OpenMP re-uses the threads.
#pragma omp parallel for
for(int i = 0; i < 4; i++){
}

int time_after = clock();

In this case, the spikes should vanish. 在这种情况下，尖峰应消失。

Answer 3

If "OpenMP parallel spiking", which I would call "parallel overhead", is a concern in your loop, this infers you probably don't have enough workload to parallelize . 如果您的循环中担心“ OpenMP并行峰值”（我称其为“并行开销”），则这意味着您可能没有足够的工作负载来并行化 。 Parallelization yields a speedup only if you have a sufficient problem size. 仅当您有足够的问题大小时，并行化才会加速。 You already showed an extreme example: no work in a parallelized loop. 您已经显示了一个极端的示例：并行循环中没有工作。 In such case, you will see highly fluctuating time due to parallel overhead. 在这种情况下，由于并行开销，您会发现时间波动很大。

The parallel overhead in OpenMP's omp parallel for includes several factors: OpenMP的omp parallel for的并行开销包括以下几个因素：

First, omp parallel for is the sum of omp parallel and omp for . 首先， omp parallel for是omp parallel和omp for的总和。
The overhead of spawning or awakening threads (many OpenMP implementations won't create/destroy every omp parallel . 产生或唤醒线程的开销（许多OpenMP实现不会在每个omp parallel创建/销毁。
Regarding omp for , overhead of (a) dispatching workloads to worker threads, (b) scheduling (especially, if dynamic scheduling is used). 关于omp for ，（a）将工作负载分派到工作线程，（b）调度（特别是如果使用动态调度）的开销。
The overhead of implicit barrier at the end of omp parallel unless nowait is specified. 除非指定了nowait否则omp parallel结束时隐式屏障的开销。

FYI, in order to measure OpenMP's parallel overhead, the following would be more effective: 仅供参考，为了衡量OpenMP的并行开销，以下方法将更有效：

double measureOverhead(int tripCount) {
  static const size_t TIMES = 10000;
  int sum = 0;

  int startTime = clock();
  for (size_t k = 0; k < TIMES; ++k) {
    for (int i = 0; i < tripCount; ++i) {
      sum += i;
    }
  }
  int elapsedTime = clock() - startTime;

  int startTime2 = clock();
  for (size_t k = 0; k < TIMES; ++k) {
  #pragma omp parallel for private(sum) // We don't care correctness of sum 
                                        // Otherwise, use "reduction(+: sum)"
    for (int i = 0; i < tripCount; ++i) {
      sum += i;
    }
  }
  int elapsedTime2 = clock() - startTime2;

  double parallelOverhead = double(elapsedTime2 - elapsedTime)/double(TIMES);
  return parallelOverhead;
}

Try to run such small code may times, then take an average. 尝试运行这么小的代码可能会花费很多时间，然后平均。 Also, put at least minimum workload in loops. 另外，至少将最少的工作量放入循环中。 In the above code, parallelOverhead is an approximated overhead of OpenMP's omp parallel for construct. 在上面的代码中， parallelOverhead是OpenMP的omp parallel for构造的近似开销。

OpenMP并行峰值

问题描述

3 个解决方案

解决方案1
2 2014-06-29 07:15:06

解决方案2
2 2014-06-29 07:38:35

解决方案3
2 2014-07-08 23:21:26

OpenMP并行峰值

问题描述

3 个解决方案

解决方案1 2 2014-06-29 07:15:06

解决方案2 2 2014-06-29 07:38:35

解决方案3 2 2014-07-08 23:21:26

解决方案1
2 2014-06-29 07:15:06

解决方案2
2 2014-06-29 07:38:35

解决方案3
2 2014-07-08 23:21:26