简体   繁体   中英

Why is my OpemMP code performance worst than serial?

I am doing a simple Pi calculation where I parallelize the loop in which random numbers are generated and count is incremented. The serial (non-OpenMP) code performs better than the OpenMP code. Here are some of measurements I took. Both codes are also provided below.

Compiled the serial code as: gcc pi.c -O3

Compiled the OpenMP code as: gcc pi_omp.c -O3 -fopenmp

What could be the problem?

# Iterations = 60000000

Serial Time = 0.893912

OpenMP 1 Threads Time = 0.876654
OpenMP 2 Threads Time = 23.8537
OpenMP 4 Threads Time = 7.72415

Serial Code:

/* Program to compute Pi using Monte Carlo methods */
/* from: http://www.dartmouth.edu/~rc/classes/soft_dev/C_simple_ex.html */

#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <string.h>
#include <time.h>
#include <sys/time.h>
#define SEED 35791246

int main(int argc, char* argv)
{
  int niter=0;
  double x,y;
  int i;
  long count=0; /* # of points in the 1st quadrant of unit circle */
  double z;
  double pi;

  printf("Enter the number of iterations used to estimate pi: ");
  scanf("%d",&niter);

  /* initialize random numbers */
  srand(SEED);
  count=0;
  struct timeval start, end;
  gettimeofday(&start, NULL);
  for ( i=0; i<niter; i++) {
    x = (double)rand()/RAND_MAX;
    y = (double)rand()/RAND_MAX;
    z = x*x+y*y;
    if (z<=1) count++;
  }
  pi=(double)count/niter*4;

  gettimeofday(&end, NULL);
  double t2 = end.tv_sec + (end.tv_usec/1000000.0);
  double t1 = start.tv_sec + (start.tv_usec/1000000.0);

  printf("Time: %lg\n", t2 - t1);

  printf("# of trials= %d , estimate of pi is %lg \n",niter,pi);
  return 0;
}

OpenMP Parallel Code:

#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <string.h>
#include <time.h>
#include <sys/time.h>
#define SEED 35791246
/*
from: http://www.dartmouth.edu/~rc/classes/soft_dev/C_simple_ex.html
 */
#define CHUNKSIZE 500
int main(int argc, char *argv[]) {

  int chunk = CHUNKSIZE;
  int niter=0;
  double x,y;
  int i;
  long count=0; /* # of points in the 1st quadrant of unit circle */
  double z;
  double pi;

  int nthreads, tid;

  printf("Enter the number of iterations used to estimate pi: ");
  scanf("%d",&niter);

  /* initialize random numbers */
  srand(SEED);
  struct timeval start, end;

  gettimeofday(&start, NULL);
  #pragma omp parallel shared(chunk) private(tid,i,x,y,z) reduction(+:count)  
  {                                                                                                           
    /* Obtain and print thread id */
    tid = omp_get_thread_num();
    //printf("Hello World from thread = %d\n", tid);

    /* Only master thread does this */
    if (tid == 0)
    {
      nthreads = omp_get_num_threads();
      printf("Number of threads = %d\n", nthreads);
    }

    #pragma omp for schedule(dynamic,chunk)                                                                       
    for ( i=0; i<niter; i++) {                                                                              
      x = (double)rand()/RAND_MAX;                                                                          
      y = (double)rand()/RAND_MAX;                                                                          
      z = x*x+y*y;                                                                                          
      if (z<=1) count++;                                                                                    
    }                                                                                                       
  }                                                                                                           

  gettimeofday(&end, NULL);
  double t2 = end.tv_sec + (end.tv_usec/1000000.0);
  double t1 = start.tv_sec + (start.tv_usec/1000000.0);

  printf("Time: %lg\n", t2 - t1);

  pi=(double)count/niter*4;                                                                                   
  printf("# of trials= %d, threads used: %d, estimate of pi is %lg \n",niter,nthreads, pi);
  return 0;
}

In this particular case, there are many possibilities since openMP takes 10K - 100K cycles to start a loop, performance improvements with openMP are non trivial.

after this we have the additional problem that rand is not re-entrant http://man7.org/linux/man-pages/man3/rand.3.html

so most likely rand can only be called by one thread at a time, hence your open MP version is essentially single threaded since your loop does little else, with the additional contention overhead every time rand is called - hence the dramatic slowdown.

rand() is not re-entrant. It will either not work properly, crash, or only be possible to call from one thread at a time. Libraries like glibc will often serialize or use TLS for legacy non-re-entrant functions rather than have them randomly crash when they get used in multi-threaded code.

Try the re-entrant form, rand_r() :

tid = omp_get_thread_num();
unsigned int seed = tid;
...
x = (double)rand_r(&seed)/RAND_MAX;

I think you'll find it's much faster.

Notice how I set the seed to the tid. You might think, why not initialize the seed to SEED ? Given the same seed, rand_r() will produce the same sequence of numbers. If each thread uses the same series of pseudo-random numbers, it defeats the point of doing more iterations! You've got to get each thread to use different numbers.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM