为什么我的OpemMP代码性能比串行性能差？

Question

I am doing a simple Pi calculation where I parallelize the loop in which random numbers are generated and count is incremented. 我正在做一个简单的Pi计算，在其中并行化生成随机数和增加计数的循环。 The serial (non-OpenMP) code performs better than the OpenMP code. 串行（非OpenMP）代码的性能优于OpenMP代码。 Here are some of measurements I took. 这是我进行的一些测量。 Both codes are also provided below. 这两个代码也在下面提供。

Compiled the serial code as: gcc pi.c -O3 将串行代码编译为：gcc pi.c -O3

Compiled the OpenMP code as: gcc pi_omp.c -O3 -fopenmp 将OpenMP代码编译为：gcc pi_omp.c -O3 -fopenmp

What could be the problem? 可能是什么问题呢？

# Iterations = 60000000

Serial Time = 0.893912

OpenMP 1 Threads Time = 0.876654
OpenMP 2 Threads Time = 23.8537
OpenMP 4 Threads Time = 7.72415

Serial Code: 串行码：

/* Program to compute Pi using Monte Carlo methods */
/* from: http://www.dartmouth.edu/~rc/classes/soft_dev/C_simple_ex.html */

#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <string.h>
#include <time.h>
#include <sys/time.h>
#define SEED 35791246

int main(int argc, char* argv)
{
  int niter=0;
  double x,y;
  int i;
  long count=0; /* # of points in the 1st quadrant of unit circle */
  double z;
  double pi;

  printf("Enter the number of iterations used to estimate pi: ");
  scanf("%d",&niter);

  /* initialize random numbers */
  srand(SEED);
  count=0;
  struct timeval start, end;
  gettimeofday(&start, NULL);
  for ( i=0; i<niter; i++) {
    x = (double)rand()/RAND_MAX;
    y = (double)rand()/RAND_MAX;
    z = x*x+y*y;
    if (z<=1) count++;
  }
  pi=(double)count/niter*4;

  gettimeofday(&end, NULL);
  double t2 = end.tv_sec + (end.tv_usec/1000000.0);
  double t1 = start.tv_sec + (start.tv_usec/1000000.0);

  printf("Time: %lg\n", t2 - t1);

  printf("# of trials= %d , estimate of pi is %lg \n",niter,pi);
  return 0;
}

OpenMP Parallel Code: OpenMP并行代码：

#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <string.h>
#include <time.h>
#include <sys/time.h>
#define SEED 35791246
/*
from: http://www.dartmouth.edu/~rc/classes/soft_dev/C_simple_ex.html
 */
#define CHUNKSIZE 500
int main(int argc, char *argv[]) {

  int chunk = CHUNKSIZE;
  int niter=0;
  double x,y;
  int i;
  long count=0; /* # of points in the 1st quadrant of unit circle */
  double z;
  double pi;

  int nthreads, tid;

  printf("Enter the number of iterations used to estimate pi: ");
  scanf("%d",&niter);

  /* initialize random numbers */
  srand(SEED);
  struct timeval start, end;

  gettimeofday(&start, NULL);
  #pragma omp parallel shared(chunk) private(tid,i,x,y,z) reduction(+:count)  
  {                                                                                                           
    /* Obtain and print thread id */
    tid = omp_get_thread_num();
    //printf("Hello World from thread = %d\n", tid);

    /* Only master thread does this */
    if (tid == 0)
    {
      nthreads = omp_get_num_threads();
      printf("Number of threads = %d\n", nthreads);
    }

    #pragma omp for schedule(dynamic,chunk)                                                                       
    for ( i=0; i<niter; i++) {                                                                              
      x = (double)rand()/RAND_MAX;                                                                          
      y = (double)rand()/RAND_MAX;                                                                          
      z = x*x+y*y;                                                                                          
      if (z<=1) count++;                                                                                    
    }                                                                                                       
  }                                                                                                           

  gettimeofday(&end, NULL);
  double t2 = end.tv_sec + (end.tv_usec/1000000.0);
  double t1 = start.tv_sec + (start.tv_usec/1000000.0);

  printf("Time: %lg\n", t2 - t1);

  pi=(double)count/niter*4;                                                                                   
  printf("# of trials= %d, threads used: %d, estimate of pi is %lg \n",niter,nthreads, pi);
  return 0;
}

Answer 1

In this particular case, there are many possibilities since openMP takes 10K - 100K cycles to start a loop, performance improvements with openMP are non trivial. 在这种特殊情况下，由于openMP需要10K-100K周期来启动循环，因此存在很多可能性，因此openMP的性能提升是不平凡的。

after this we have the additional problem that rand is not re-entrant http://man7.org/linux/man-pages/man3/rand.3.html 在此之后，我们还有另一个问题，即rand无法重新进入http://man7.org/linux/man-pages/man3/rand.3.html

so most likely rand can only be called by one thread at a time, hence your open MP version is essentially single threaded since your loop does little else, with the additional contention overhead every time rand is called - hence the dramatic slowdown. 因此，最有可能一次只能由一个线程调用rand，因此您的开放MP版本本质上是单线程的，因为您的循环几乎没有其他作用，每次调用rand都会产生额外的争用开销-因此，速度会急剧下降。

Answer 2

rand() is not re-entrant. rand()不可重入。 It will either not work properly, crash, or only be possible to call from one thread at a time. 它要么不能正常工作，崩溃，要么只能一次从一个线程调用。 Libraries like glibc will often serialize or use TLS for legacy non-re-entrant functions rather than have them randomly crash when they get used in multi-threaded code. 像glibc这样的库通常会序列化或将TLS用于旧的非重入函数，而不是在多线程代码中使用它们时使它们随机崩溃。

Try the re-entrant form, rand_r() : 试试可重入表格rand_r() ：

tid = omp_get_thread_num();
unsigned int seed = tid;
...
x = (double)rand_r(&seed)/RAND_MAX;

I think you'll find it's much faster. 我想您会发现它更快。

Notice how I set the seed to the tid. 注意我如何将种子设置为潮汐。 You might think, why not initialize the seed to SEED ? 您可能会想，为什么不将种子初始化为SEED呢？ Given the same seed, rand_r() will produce the same sequence of numbers. 给定相同的种子， rand_r()将产生相同的数字序列。 If each thread uses the same series of pseudo-random numbers, it defeats the point of doing more iterations! 如果每个线程使用相同系列的伪随机数，那么它就会失去进行更多迭代的目的！ You've got to get each thread to use different numbers. 您必须让每个线程使用不同的数字。

为什么我的OpemMP代码性能比串行性能差？

问题描述

2 个解决方案

解决方案1
1 2017-04-16 22:13:16

解决方案2
1 已采纳 2017-04-16 22:48:44

为什么我的OpemMP代码性能比串行性能差？

问题描述

2 个解决方案

解决方案1 1 2017-04-16 22:13:16

解决方案2 1 已采纳 2017-04-16 22:48:44

解决方案1
1 2017-04-16 22:13:16

解决方案2
1 已采纳 2017-04-16 22:48:44