Why is my parallel code slower than sequential code?

Question

I have implemented a parallel code in C for merge sort using OPENMP. I get speed up of 3.9 seconds which is quite slower that the sequential version of the same code(for which i get 3.6). I am trying to optimise the code to the best possible state but cant increase the speedup. Can you please help out with this? Thanks.

 void partition(int arr[],int arr1[],int low,int high,int thread_count)
 {
int tid,mid;

#pragma omp if
if(low<high)
{
    if(thread_count==1)
    {
            mid=(low+high)/2;
            partition(arr,arr1,low,mid,thread_count);
            partition(arr,arr1,mid+1,high,thread_count);
                sort(arr,arr1,low,mid,high);
    }
    else
    {
        #pragma omp parallel num_threads(thread_count) 
        {
                mid=(low+high)/2;
                #pragma omp parallel sections  
                {
                    #pragma omp section
                    {
                        partition(arr,arr1,low,mid,thread_count/2);
                        }
                    #pragma omp section
                    {   
                        partition(arr,arr1,mid+1,high,thread_count/2);
                    }
                }
        }
        sort(arr,arr1,low,mid,high);

    }
}
 }

Answer 1

As was correctly noted, there are several mistakes in your code that prevent its correct execution, so I would first suggest to review these errors.

Anyhow, taking into account only how OpenMP performance scales with thread, maybe an implementation based on task directives would fit better as it overcomes the limits already pointed by a previous answer:

Since the sections directive only has two sections, I think you won't get any benefit from spawning more threads than two in the parallel clause

You can find a trace of such an implementation below:

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <sys/time.h>

void getTime(double *t) {

  struct timeval tv;

  gettimeofday(&tv, 0);
  *t = tv.tv_sec + (tv.tv_usec * 1e-6);
}

int compare( const void * pa, const void * pb ) {

  const int a = *((const int*) pa);
  const int b = *((const int*) pb);

  return (a-b);
}

void merge(int * array, int * workspace, int low, int mid, int high) {

  int i = low;
  int j = mid + 1;
  int l = low;

  while( (l <= mid) && (j <= high) ) {
    if( array[l] <= array[j] ) {
      workspace[i] = array[l];
      l++;
    } else {
      workspace[i] = array[j];
      j++;
    }
    i++;
  }
  if (l > mid) {
    for(int k=j; k <= high; k++) {
      workspace[i]=array[k];
      i++;
    }
  } else {
    for(int k=l; k <= mid; k++) {
      workspace[i]=array[k];
      i++;
    }
  }
  for(int k=low; k <= high; k++) {
    array[k] = workspace[k];
  }
}

void mergesort_impl(int array[],int workspace[],int low,int high) {

  const int threshold = 1000000;

  if( high - low > threshold  ) {
    int mid = (low+high)/2;
    /* Recursively sort on halves */
#ifdef _OPENMP
#pragma omp task 
#endif
    mergesort_impl(array,workspace,low,mid);
#ifdef _OPENMP
#pragma omp task
#endif
    mergesort_impl(array,workspace,mid+1,high);
#ifdef _OPENMP
#pragma omp taskwait
#endif
    /* Merge the two sorted halves */
#ifdef _OPENMP
#pragma omp task
#endif
    merge(array,workspace,low,mid,high);
#ifdef _OPENMP
#pragma omp taskwait
#endif
  } else if (high - low > 0) {
    /* Coarsen the base case */
    qsort(&array[low],high-low+1,sizeof(int),compare);
  }

}

void mergesort(int array[],int workspace[],int low,int high) {
  #ifdef _OPENMP
  #pragma omp parallel
  #endif
  {
#ifdef _OPENMP
#pragma omp single nowait
#endif
    mergesort_impl(array,workspace,low,high);
  }
}

const size_t largest = 100000000;
const size_t length  = 10000000;

int main(int argc, char *argv[]) {

  int * array = NULL;
  int * workspace = NULL;

  double start,end;

  printf("Largest random number generated: %d \n",RAND_MAX);
  printf("Largest random number after truncation: %d \n",largest);
  printf("Array size: %d \n",length);
  /* Allocate and initialize random vector */
  array     = (int*) malloc(length*sizeof(int));
  workspace = (int*) malloc(length*sizeof(int));
  for( int ii = 0; ii < length; ii++)
    array[ii] = rand()%largest;
  /* Sort */  
  getTime(&start);
  mergesort(array,workspace,0,length-1);
  getTime(&end);
  printf("Elapsed time sorting: %g sec.\n", end-start);
  /* Check result */
  for( int ii = 1; ii < length; ii++) {
    if( array[ii] < array[ii-1] ) printf("Error:\n%d %d\n%d %d\n",ii-1,array[ii-1],ii,array[ii]);
  }
  free(array);
  free(workspace);
  return 0;
}

Notice that if you seek performances you also have to guarantee that the base case of your recursion is coarse enough to avoid substantial overhead due to recursive function calls. Other than that, I would suggest to profile your code so you can have a good hint on which parts are really worth optimizing.

Answer 2

It took some figuring out, which is a bit embarassing, since when you see it, the answer is so simple.

As it stands in the question, the program doesn't work correctly, instead it randomly on some runs duplicates some numbers and loses others. This appears to be a totally parallel error, that doesn't arise when running the program with the variable thread_count == 1.

The pragma "parallel sections", is a combined parallel and sections directive, which in this case means, that it starts a second parallel region inside the previous one. Parallel regions inside other parallel regions are fine, but I think most implementation don't give you extra threads when they encounter a nested parallel region.

The fix is to replace

 #pragma omp parallel sections

with

 #pragma omp sections

After this fix, the program starts to give correct answers, and with a two core system and for a million numbers I get for timing the following results.

One thread:

time taken: 0.378794

Two threads:

time taken: 0.203178

Since the sections directive only has two sections, I think you won't get any benefit from spawning more threads than two in the parallel clause, so change num_threads(thread_count) -> num_threads(2)

But because of the fact that at least the two implementations I tried are not able to spawn new threads for nested parallel regions, the program as it stands doesn't scale to more than two threads.

Why is my parallel code slower than sequential code?

Question

2 answers

solution1
3 2012-09-16 17:28:55

solution2
2 ACCPTED 2012-09-16 16:16:14

Why is my parallel code slower than sequential code?

Question

2 answers

solution1 3 2012-09-16 17:28:55

solution2 2 ACCPTED 2012-09-16 16:16:14

solution1
3 2012-09-16 17:28:55

solution2
2 ACCPTED 2012-09-16 16:16:14