openmp慢了多个线程，想不通

Question

I got a problem that my following code runs slower with openmp: 我遇到一个问题，我的以下代码使用openmp运行速度较慢：

chunk = nx/nthreads;
int i, j;
for(int t = 0; t < n; t++){
     #pragma omp parallel for default(shared) private(i, j) schedule(static,chunk) 
     for(i = 1; i < nx/2+1; i++){
        for(j = 1; j < nx-1; j++){
            T_c[i][j] =0.25*(T_p[i-1][j] +T_p[i+1][j]+T_p[i][j-1]+T_p[i][j+1]);
            T_c[nx-i+1][j] = T_c[i][j];
        }
    }
    copyT(T_p, T_c, nx);
}
print2file(T_c, nx, file);

The problem is when I run more than one threads, the computational time will be much longer. 问题是当我运行多个线程时，计算时间会更长。

Answer 1

First, your parallel region is restarted on each iteration of the outer loop, thus adding a huge overhead. 首先，在外循环的每次迭代中重新启动并行区域，从而增加了巨大的开销。

Second, half of the threads would be just sitting there doing nothing since your chunk size is twice as bigger as it should be - it is nx/nthreads while the number of iterations of the parallel loop is nx/2 , hence there are (nx/2)/(nx/nthreads) = nthreads/2 chunks in total. 其次，一半的线程只是坐在那里什么也不做，因为你的块大小是它应该的两倍 - 它是nx/nthreads而并行循环的迭代次数是nx/2 ，因此有(nx/2)/(nx/nthreads) = nthreads/2块。 Besides what you have tried to achieve is to replicate the behaviour of schedule(static) . 除了你试图实现的是复制schedule(static)的行为。

#pragma omp parallel
for (int t = 0; t < n; t++) {
   #pragma omp for schedule(static) 
   for (int i = 1; i < nx/2+1; i++) {
      for (int j = 1; j < nx-1; j++) {
         T_c[i][j] = 0.25*(T_p[i-1][j]+T_p[i+1][j]+T_p[i][j-1]+T_p[i][j+1]);
         T_c[nx-i-1][j] = T_c[i][j];
      }
   }
   #pragma omp single
   copyT(T_p, T_c, nx);
}
print2file(T_c, nx, file);

If you modify copyT to also use parallel for , then the single construct should be removed. 如果修改copyT也使用parallel for ，则应删除single构造。 You do not need default(shared) as this is the default. 您不需要default(shared)因为这是默认设置。 You do not to declare the loop variable of a parallel loop private - even if this variable comes from an outer scope (and hence is implicitly shared in the region), OpenMP automatically makes it private. 您不要将并行循环的循环变量声明为private - 即使此变量来自外部作用域（因此在区域中隐式共享），OpenMP也会自动将其设置为私有。 Simply declare all loop variables in the loop controls and it works automagically with the default sharing rules applied. 只需在循环控件中声明所有循环变量，它就会自动运行并应用默认的共享规则。

Second and a half, there is (probably) an error in your inner loop. 第二个半月，你的内循环中可能存在（可能）错误。 The second assingment statement should read: 第二个分配声明应为：

T_c[nx-i-1][j] = T_c[i][j];

(or T_c[nx-i][j] if you do not keep a halo on the lower side) otherwise when i equals 1 , then you would be accessing T_c[nx][...] which is outside the bounds of T_c . （或T_c[nx-i][j]如果不保持对下侧的卤素），否则当i等于1 ，那么你将被访问T_c[nx][...]是的边界之外T_c 。

Third, a general hint: instead of copying one array into another, use pointers to those arrays and just swap the two pointers at the end of each iteration. 第三，一般提示：不是将一个数组复制到另一个数组，而是使用指向这些数组的指针，并在每次迭代结束时交换两个指针。

Answer 2

I see at least three problems that could lead to bad performance in the snippet you posted: 我发现至少有三个问题可能会导致您发布的代码段性能下降：

the chunk size is too small to show any gain when divided among threads. 块大小太小，不能在线程之间划分时显示任何增益。
the opening and closing of a parallel region inside a loop may hurt performance. 循环内的parallel区域的打开和关闭可能会损害性能。
the two innermost loops appear to be independent, and you parallelize only one of them (losing a possibility to exploit a wider iteration space). 两个最里面的循环看起来是独立的，并且只对其中一个进行并行化（失去了利用更宽迭代空间的可能性）。

You can find below a trace of some modifications I would do on the code: 您可以在下面找到我将对代码进行的一些修改：

// Moving the omp parallel you open/close the parallel 
// region only one time, not n times
#pragma omp parallel default(shared)
for(int t = 0; t < n; t++){
     // With collapse you parallelize over an iteration space that is 
     // composed of (nx/2+1)*(nx-1) elements not only (nx/2+1)
     #pragma omp for collapse(2) schedule(static)
     for(int i = 1; i < nx/2+1; i++){
        for(int j = 1; j < nx-1; j++){
            T_c[i][j] =0.25*(T_p[i-1][j] +T_p[i+1][j]+T_p[i][j-1]+T_p[i][j+1]);
            T_c[nx-i+1][j] = T_c[i][j];
        }
    }
    // As the iteration space is very small and the work done 
    // at each iteration is not much, static schedule will likely be the best option
    // as it is the one that adds the least overhead for scheduling
    copyT(T_p, T_c, nx);
}
print2file(T_c, nx, file);

openmp慢了多个线程，想不通

问题描述

2 个解决方案

解决方案1
2 2012-11-13 08:45:07

解决方案2
1 2012-11-13 08:15:21

openmp慢了多个线程，想不通

问题描述

2 个解决方案

解决方案1 2 2012-11-13 08:45:07

解决方案2 1 2012-11-13 08:15:21

解决方案1
2 2012-11-13 08:45:07

解决方案2
1 2012-11-13 08:15:21