简体   繁体   中英

openmp slower more than one threads, can't figure out

I got a problem that my following code runs slower with openmp:

chunk = nx/nthreads;
int i, j;
for(int t = 0; t < n; t++){
     #pragma omp parallel for default(shared) private(i, j) schedule(static,chunk) 
     for(i = 1; i < nx/2+1; i++){
        for(j = 1; j < nx-1; j++){
            T_c[i][j] =0.25*(T_p[i-1][j] +T_p[i+1][j]+T_p[i][j-1]+T_p[i][j+1]);
            T_c[nx-i+1][j] = T_c[i][j];
        }
    }
    copyT(T_p, T_c, nx);
}
print2file(T_c, nx, file);

The problem is when I run more than one threads, the computational time will be much longer.

First, your parallel region is restarted on each iteration of the outer loop, thus adding a huge overhead.

Second, half of the threads would be just sitting there doing nothing since your chunk size is twice as bigger as it should be - it is nx/nthreads while the number of iterations of the parallel loop is nx/2 , hence there are (nx/2)/(nx/nthreads) = nthreads/2 chunks in total. Besides what you have tried to achieve is to replicate the behaviour of schedule(static) .

#pragma omp parallel
for (int t = 0; t < n; t++) {
   #pragma omp for schedule(static) 
   for (int i = 1; i < nx/2+1; i++) {
      for (int j = 1; j < nx-1; j++) {
         T_c[i][j] = 0.25*(T_p[i-1][j]+T_p[i+1][j]+T_p[i][j-1]+T_p[i][j+1]);
         T_c[nx-i-1][j] = T_c[i][j];
      }
   }
   #pragma omp single
   copyT(T_p, T_c, nx);
}
print2file(T_c, nx, file);

If you modify copyT to also use parallel for , then the single construct should be removed. You do not need default(shared) as this is the default. You do not to declare the loop variable of a parallel loop private - even if this variable comes from an outer scope (and hence is implicitly shared in the region), OpenMP automatically makes it private. Simply declare all loop variables in the loop controls and it works automagically with the default sharing rules applied.

Second and a half, there is (probably) an error in your inner loop. The second assingment statement should read:

T_c[nx-i-1][j] = T_c[i][j];

(or T_c[nx-i][j] if you do not keep a halo on the lower side) otherwise when i equals 1 , then you would be accessing T_c[nx][...] which is outside the bounds of T_c .

Third, a general hint: instead of copying one array into another, use pointers to those arrays and just swap the two pointers at the end of each iteration.

I see at least three problems that could lead to bad performance in the snippet you posted:

  1. the chunk size is too small to show any gain when divided among threads.
  2. the opening and closing of a parallel region inside a loop may hurt performance.
  3. the two innermost loops appear to be independent, and you parallelize only one of them (losing a possibility to exploit a wider iteration space).

You can find below a trace of some modifications I would do on the code:

// Moving the omp parallel you open/close the parallel 
// region only one time, not n times
#pragma omp parallel default(shared)
for(int t = 0; t < n; t++){
     // With collapse you parallelize over an iteration space that is 
     // composed of (nx/2+1)*(nx-1) elements not only (nx/2+1)
     #pragma omp for collapse(2) schedule(static)
     for(int i = 1; i < nx/2+1; i++){
        for(int j = 1; j < nx-1; j++){
            T_c[i][j] =0.25*(T_p[i-1][j] +T_p[i+1][j]+T_p[i][j-1]+T_p[i][j+1]);
            T_c[nx-i+1][j] = T_c[i][j];
        }
    }
    // As the iteration space is very small and the work done 
    // at each iteration is not much, static schedule will likely be the best option
    // as it is the one that adds the least overhead for scheduling
    copyT(T_p, T_c, nx);
}
print2file(T_c, nx, file);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM