简体   繁体   中英

OpenMP for loop

I am new to Openmp, and my task is to improve the below code using two different possibilities:

// Size = 400, CHUNKSIZE = 100 and there are 4 threads
#pragma omp parallel for schedule(static, CHUNK_SIZE)
  for(int i=0; i<SIZE; i++){
      for(int k=0; k<i; k++){
          A[i][k] = 42*foo 
       }
   }

At first I would change the schedule from static to guided, because the work in the second loop is unbalanced and steadily growing. So at first the chunk size starts off large and decreases to better handle load imbalance between iterations. The larger ì becomes, the more work it is for the second loop. At this point I am not sure, if dynamic may be better instead of guided?

For the second possibility I have no idea.

So, just by looking at the code, you can tell that there are load-balancing problems. IMO you should test your code with schedule(static, 1) to guarantee that you have the minimal load-unbalancing between threads (at most only one iteration). Then you compare with schedule(dynamic, 1) , and verify if the overhead of using the dynamic -- dynamic has an internal locking mechanism -- is overwhelmed by the gain in balancing the work among the threads.

If you look carefully, you can see that the work of the inner loop grows with a shape like a triangle (N = SIZE) :

 *k/i 0 1 2 3 4 5 ... N-1
 *  0 - x x x x x ... x 
 *  1 - - x x x x ... x 
 *  2 - - - x x x ... x
 *  3 - - - - x x ... x
 *  4 - - - - - x ... x
 *  5 - - - - - - ... x
 *  . - - - - - - ... x
 *  . - - - - - - ... x 
 *N-1 - - - - - - ... -    
 *  N - - - - - - ... - 

So you can make your own distribution to guarantee that the thread that performs iteration 0 , performs also the iteration N-1 , and that the thread that performs iteration 1 , also performs the iteration N-2 , and so on. In this manner, you guarantee that for each iteration threads will perform N - 1 inner loop iterations. Something as follows:

    int halfSIZE = SIZE >> 1;

    #pragma omp for schedule (static,1) nowait  
    for(int i = 0; i < halfSIZE; i++)
    {
        for(int k=0; k<i; k++)
           A[i][k] = 42*foo 
    } 
    
    #pragma omp for schedule (static,1)       
    for(int i = SIZE - 1; i >= halfSIZE; i--)
    {
          for(int k=0; k<i; k++)
            A[i][k] = 42*foo 
    }  

Assuming you mean to have the pair of loops contained in a single omp parallel, this may be a reasonable method to avoid work imbalance. Still I don't see that it guarantees much. You could put an outer loop over the number of threads and calculate the number of iterations for the i loop which closely balances the number of array elements set by each thread. This can be a more effective way of maintaining NUMA locality, if that is important for your target.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM