openmp慢了多個線程，想不通

Question

我遇到一個問題，我的以下代碼使用openmp運行速度較慢：

chunk = nx/nthreads;
int i, j;
for(int t = 0; t < n; t++){
     #pragma omp parallel for default(shared) private(i, j) schedule(static,chunk) 
     for(i = 1; i < nx/2+1; i++){
        for(j = 1; j < nx-1; j++){
            T_c[i][j] =0.25*(T_p[i-1][j] +T_p[i+1][j]+T_p[i][j-1]+T_p[i][j+1]);
            T_c[nx-i+1][j] = T_c[i][j];
        }
    }
    copyT(T_p, T_c, nx);
}
print2file(T_c, nx, file);

問題是當我運行多個線程時，計算時間會更長。

Answer 1

首先，在外循環的每次迭代中重新啟動並行區域，從而增加了巨大的開銷。

其次，一半的線程只是坐在那里什么也不做，因為你的塊大小是它應該的兩倍 - 它是nx/nthreads而並行循環的迭代次數是nx/2 ，因此有(nx/2)/(nx/nthreads) = nthreads/2塊。 除了你試圖實現的是復制schedule(static)的行為。

#pragma omp parallel
for (int t = 0; t < n; t++) {
   #pragma omp for schedule(static) 
   for (int i = 1; i < nx/2+1; i++) {
      for (int j = 1; j < nx-1; j++) {
         T_c[i][j] = 0.25*(T_p[i-1][j]+T_p[i+1][j]+T_p[i][j-1]+T_p[i][j+1]);
         T_c[nx-i-1][j] = T_c[i][j];
      }
   }
   #pragma omp single
   copyT(T_p, T_c, nx);
}
print2file(T_c, nx, file);

如果修改copyT也使用parallel for ，則應刪除single構造。 您不需要default(shared)因為這是默認設置。 您不要將並行循環的循環變量聲明為private - 即使此變量來自外部作用域（因此在區域中隱式共享），OpenMP也會自動將其設置為私有。 只需在循環控件中聲明所有循環變量，它就會自動運行並應用默認的共享規則。

第二個半月，你的內循環中可能存在（可能）錯誤。 第二個分配聲明應為：

T_c[nx-i-1][j] = T_c[i][j];

（或T_c[nx-i][j]如果不保持對下側的鹵素），否則當i等於1 ，那么你將被訪問T_c[nx][...]是的邊界之外T_c 。

第三，一般提示：不是將一個數組復制到另一個數組，而是使用指向這些數組的指針，並在每次迭代結束時交換兩個指針。

Answer 2

我發現至少有三個問題可能會導致您發布的代碼段性能下降：

塊大小太小，不能在線程之間划分時顯示任何增益。
循環內的parallel區域的打開和關閉可能會損害性能。
兩個最里面的循環看起來是獨立的，並且只對其中一個進行並行化（失去了利用更寬迭代空間的可能性）。

您可以在下面找到我將對代碼進行的一些修改：

// Moving the omp parallel you open/close the parallel 
// region only one time, not n times
#pragma omp parallel default(shared)
for(int t = 0; t < n; t++){
     // With collapse you parallelize over an iteration space that is 
     // composed of (nx/2+1)*(nx-1) elements not only (nx/2+1)
     #pragma omp for collapse(2) schedule(static)
     for(int i = 1; i < nx/2+1; i++){
        for(int j = 1; j < nx-1; j++){
            T_c[i][j] =0.25*(T_p[i-1][j] +T_p[i+1][j]+T_p[i][j-1]+T_p[i][j+1]);
            T_c[nx-i+1][j] = T_c[i][j];
        }
    }
    // As the iteration space is very small and the work done 
    // at each iteration is not much, static schedule will likely be the best option
    // as it is the one that adds the least overhead for scheduling
    copyT(T_p, T_c, nx);
}
print2file(T_c, nx, file);

openmp慢了多個線程，想不通

問題描述

2 個解決方案

解決方案1
2 2012-11-13 08:45:07

解決方案2
1 2012-11-13 08:15:21

openmp慢了多個線程，想不通

問題描述

2 個解決方案

解決方案1 2 2012-11-13 08:45:07

解決方案2 1 2012-11-13 08:15:21

解決方案1
2 2012-11-13 08:45:07

解決方案2
1 2012-11-13 08:15:21