简体   繁体   中英

OpenMP parallel spiking

I'm using OpenMP in Visual Studio 2010 to speed up loops.

I wrote a very simple test to see the performance increase using OpenMP. I use omp parallel on an empty loop

int time_before = clock();

#pragma omp parallel for
for(i = 0; i < 4; i++){

}

int time_after = clock();

std::cout << "time elapsed: " << (time_after - time_before) << " milliseconds" << std::endl;

Without the omp pragma it consistently takes 0 milliseconds to complete (as expected), and with the pragma it usually takes 0 as well. The problem is that with the opm pragma it spikes occasionally, anywhere from 10 to 32 milliseconds. Every time I tried parallel with OpenMP I get these random spikes, so I tried this very basic test. Are the spikes an inherent part of OpenMP, or can they be avoided?

The parallel for gives me great speed boosts on some loops, but these random spikes are too big for me to be able to use it.

Thats pretty normal behiavor. Sometimes your operation system is busy and need more time to spawn new threads.

I want to complement the answer of kukis: I'd also say, that the reason for the spikes are due to the additional overhead that comes with OpenMP.

Furthermore, as you are doing performance-sensitive measurements, I hope that you compiled your code with optimizations turned on. In that case, the loop without OpenMP simply gets optimized out by the compiler, so there is no code in between time_before and time_after . With OpenMP, however, at least g++ 4.8.1 ( -O3 ) is unable to optimize the code: The loop is still there in the assembler, and contains additional statements to manage the work-sharing. (I cannot try it with VS at the moment.)

So, the comparison is not really fair, as the one without OpenMP gets optimized out completely.

Edit: You also have to keep in mind, that OpenMP doesn't re-create threads everytime. Rather it uses a thread-pool. So, if you execute an omp-construct before your loop, the threads will already be created when it encounters another one:

// Dummy loop: Spawn the threads.
#pragma omp parallel for
for(int i = 0; i < 4; i++){
}

int time_before = clock();

// Do the actual measurement. OpenMP re-uses the threads.
#pragma omp parallel for
for(int i = 0; i < 4; i++){
}

int time_after = clock();

In this case, the spikes should vanish.

If "OpenMP parallel spiking", which I would call "parallel overhead", is a concern in your loop, this infers you probably don't have enough workload to parallelize . Parallelization yields a speedup only if you have a sufficient problem size. You already showed an extreme example: no work in a parallelized loop. In such case, you will see highly fluctuating time due to parallel overhead.

The parallel overhead in OpenMP's omp parallel for includes several factors:

  • First, omp parallel for is the sum of omp parallel and omp for .
  • The overhead of spawning or awakening threads (many OpenMP implementations won't create/destroy every omp parallel .
  • Regarding omp for , overhead of (a) dispatching workloads to worker threads, (b) scheduling (especially, if dynamic scheduling is used).
  • The overhead of implicit barrier at the end of omp parallel unless nowait is specified.

FYI, in order to measure OpenMP's parallel overhead, the following would be more effective:

double measureOverhead(int tripCount) {
  static const size_t TIMES = 10000;
  int sum = 0;

  int startTime = clock();
  for (size_t k = 0; k < TIMES; ++k) {
    for (int i = 0; i < tripCount; ++i) {
      sum += i;
    }
  }
  int elapsedTime = clock() - startTime;

  int startTime2 = clock();
  for (size_t k = 0; k < TIMES; ++k) {
  #pragma omp parallel for private(sum) // We don't care correctness of sum 
                                        // Otherwise, use "reduction(+: sum)"
    for (int i = 0; i < tripCount; ++i) {
      sum += i;
    }
  }
  int elapsedTime2 = clock() - startTime2;

  double parallelOverhead = double(elapsedTime2 - elapsedTime)/double(TIMES);
  return parallelOverhead;
}

Try to run such small code may times, then take an average. Also, put at least minimum workload in loops. In the above code, parallelOverhead is an approximated overhead of OpenMP's omp parallel for construct.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM