为什么我的并行代码比顺序代码慢？

Question

I have implemented a parallel code in C for merge sort using OPENMP. 我在C中实现了一个并行代码，用于使用OPENMP进行合并排序。 I get speed up of 3.9 seconds which is quite slower that the sequential version of the same code(for which i get 3.6). 我的速度提高了3.9秒，这比相同代码的顺序版本慢得多（我得到的是3.6）。 I am trying to optimise the code to the best possible state but cant increase the speedup. 我试图将代码优化到最佳状态但不能提高加速。 Can you please help out with this? 你能帮忙解决这个问题吗？ Thanks. 谢谢。

 void partition(int arr[],int arr1[],int low,int high,int thread_count)
 {
int tid,mid;

#pragma omp if
if(low<high)
{
    if(thread_count==1)
    {
            mid=(low+high)/2;
            partition(arr,arr1,low,mid,thread_count);
            partition(arr,arr1,mid+1,high,thread_count);
                sort(arr,arr1,low,mid,high);
    }
    else
    {
        #pragma omp parallel num_threads(thread_count) 
        {
                mid=(low+high)/2;
                #pragma omp parallel sections  
                {
                    #pragma omp section
                    {
                        partition(arr,arr1,low,mid,thread_count/2);
                        }
                    #pragma omp section
                    {   
                        partition(arr,arr1,mid+1,high,thread_count/2);
                    }
                }
        }
        sort(arr,arr1,low,mid,high);

    }
}
 }

Answer 1

As was correctly noted, there are several mistakes in your code that prevent its correct execution, so I would first suggest to review these errors. 正如已经正确指出的那样，您的代码中存在一些阻止其正确执行的错误，因此我首先建议您查看这些错误。

Anyhow, taking into account only how OpenMP performance scales with thread, maybe an implementation based on task directives would fit better as it overcomes the limits already pointed by a previous answer: 无论如何，只考虑OpenMP性能如何与线程一起扩展，也许基于任务指令的实现更适合，因为它克服了前面答案已经指出的限制：

Since the sections directive only has two sections, I think you won't get any benefit from spawning more threads than two in the parallel clause 由于sections指令只有两个部分，我认为你不会从并行子句中产生比两个更多的线程获得任何好处

You can find a trace of such an implementation below: 您可以在下面找到这种实现的痕迹：

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <sys/time.h>

void getTime(double *t) {

  struct timeval tv;

  gettimeofday(&tv, 0);
  *t = tv.tv_sec + (tv.tv_usec * 1e-6);
}

int compare( const void * pa, const void * pb ) {

  const int a = *((const int*) pa);
  const int b = *((const int*) pb);

  return (a-b);
}

void merge(int * array, int * workspace, int low, int mid, int high) {

  int i = low;
  int j = mid + 1;
  int l = low;

  while( (l <= mid) && (j <= high) ) {
    if( array[l] <= array[j] ) {
      workspace[i] = array[l];
      l++;
    } else {
      workspace[i] = array[j];
      j++;
    }
    i++;
  }
  if (l > mid) {
    for(int k=j; k <= high; k++) {
      workspace[i]=array[k];
      i++;
    }
  } else {
    for(int k=l; k <= mid; k++) {
      workspace[i]=array[k];
      i++;
    }
  }
  for(int k=low; k <= high; k++) {
    array[k] = workspace[k];
  }
}

void mergesort_impl(int array[],int workspace[],int low,int high) {

  const int threshold = 1000000;

  if( high - low > threshold  ) {
    int mid = (low+high)/2;
    /* Recursively sort on halves */
#ifdef _OPENMP
#pragma omp task 
#endif
    mergesort_impl(array,workspace,low,mid);
#ifdef _OPENMP
#pragma omp task
#endif
    mergesort_impl(array,workspace,mid+1,high);
#ifdef _OPENMP
#pragma omp taskwait
#endif
    /* Merge the two sorted halves */
#ifdef _OPENMP
#pragma omp task
#endif
    merge(array,workspace,low,mid,high);
#ifdef _OPENMP
#pragma omp taskwait
#endif
  } else if (high - low > 0) {
    /* Coarsen the base case */
    qsort(&array[low],high-low+1,sizeof(int),compare);
  }

}

void mergesort(int array[],int workspace[],int low,int high) {
  #ifdef _OPENMP
  #pragma omp parallel
  #endif
  {
#ifdef _OPENMP
#pragma omp single nowait
#endif
    mergesort_impl(array,workspace,low,high);
  }
}

const size_t largest = 100000000;
const size_t length  = 10000000;

int main(int argc, char *argv[]) {

  int * array = NULL;
  int * workspace = NULL;

  double start,end;

  printf("Largest random number generated: %d \n",RAND_MAX);
  printf("Largest random number after truncation: %d \n",largest);
  printf("Array size: %d \n",length);
  /* Allocate and initialize random vector */
  array     = (int*) malloc(length*sizeof(int));
  workspace = (int*) malloc(length*sizeof(int));
  for( int ii = 0; ii < length; ii++)
    array[ii] = rand()%largest;
  /* Sort */  
  getTime(&start);
  mergesort(array,workspace,0,length-1);
  getTime(&end);
  printf("Elapsed time sorting: %g sec.\n", end-start);
  /* Check result */
  for( int ii = 1; ii < length; ii++) {
    if( array[ii] < array[ii-1] ) printf("Error:\n%d %d\n%d %d\n",ii-1,array[ii-1],ii,array[ii]);
  }
  free(array);
  free(workspace);
  return 0;
}

Notice that if you seek performances you also have to guarantee that the base case of your recursion is coarse enough to avoid substantial overhead due to recursive function calls. 请注意，如果您寻求性能，您还必须保证递归的基本情况足够粗，以避免由于递归函数调用而产生大量开销。 Other than that, I would suggest to profile your code so you can have a good hint on which parts are really worth optimizing. 除此之外，我建议您对代码进行分析，以便您可以很好地了解哪些部分值得优化。

Answer 2

It took some figuring out, which is a bit embarassing, since when you see it, the answer is so simple. 这需要一些搞清楚，这有点令人尴尬，因为当你看到它时，答案是如此简单。

As it stands in the question, the program doesn't work correctly, instead it randomly on some runs duplicates some numbers and loses others. 正如问题所在，程序无法正常运行，而是在某些运行中随机复制某些数字并丢失其他数字。 This appears to be a totally parallel error, that doesn't arise when running the program with the variable thread_count == 1. 这似乎是一个完全并行的错误，在使用变量thread_count == 1运行程序时不会出现这种错误。

The pragma "parallel sections", is a combined parallel and sections directive, which in this case means, that it starts a second parallel region inside the previous one. 编译指示“并行部分”是组合的并行和部分指令，在这种情况下意味着它在前一个内部开始第二个并行区域。 Parallel regions inside other parallel regions are fine, but I think most implementation don't give you extra threads when they encounter a nested parallel region. 其他并行区域内的并行区域很好，但我认为大多数实现在遇到嵌套并行区域时不会给你额外的线程。

The fix is to replace 修复是要替换

 #pragma omp parallel sections

with 同

 #pragma omp sections

After this fix, the program starts to give correct answers, and with a two core system and for a million numbers I get for timing the following results. 在此修复之后，程序开始给出正确的答案，并且使用两个核心系统并且对于一百万个数字，我得到以下结果的计时。

One thread: 一个帖子：

time taken: 0.378794

Two threads: 两个线程：

time taken: 0.203178

Since the sections directive only has two sections, I think you won't get any benefit from spawning more threads than two in the parallel clause, so change num_threads(thread_count) -> num_threads(2) 由于sections指令只有两个部分，我认为你不会从并行子句中产生比两个更多的线程获得任何好处，所以更改num_threads（thread_count） - > num_threads（2）

But because of the fact that at least the two implementations I tried are not able to spawn new threads for nested parallel regions, the program as it stands doesn't scale to more than two threads. 但是由于至少我尝试的两个实现不能为嵌套的并行区域生成新线程，所以程序不能扩展到两个以上的线程。

为什么我的并行代码比顺序代码慢？

问题描述

2 个解决方案

解决方案1
3 2012-09-16 17:28:55

解决方案2
2 已采纳 2012-09-16 16:16:14

为什么我的并行代码比顺序代码慢？

问题描述

2 个解决方案

解决方案1 3 2012-09-16 17:28:55

解决方案2 2 已采纳 2012-09-16 16:16:14

解决方案1
3 2012-09-16 17:28:55

解决方案2
2 已采纳 2012-09-16 16:16:14