简体   繁体   中英

Optimization with Loop Tiling and OpenMP

Below is my function that I'm trying to optimize using OpenMP and Loop Tiling(aka Loop Blocking). However, my output of out currently gives the wrong value after I apply the loop tiling like below. Can someone look over my code, and point out what makes it wrong. Thank you so much

#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
#include "utils.h"
const long BLOCK_SIZE = 8*DIM;
int i, j, k,ii,jj,kk, dim = DIM-1;

long compute, out = 1.0, we_need, gimmie;

void work_it_par(long *old, long *new)
{
 we_need = need_func();
 gimmie = gimmie_func();

 #pragma omp parallel for private(i,j,k,ii,jj,kk, compute)      firstprivate(we_need, gimmie, dim,old,BLOCK_SIZE) reduction(+:out)   num_threads(omp_get_num_procs())
for (ii=1; ii<dim-BLOCK_SIZE; ii+=BLOCK_SIZE) {
  for (jj=1; jj<dim-BLOCK_SIZE; jj+=BLOCK_SIZE) {
    for (kk=1; kk<dim-BLOCK_SIZE; kk+=BLOCK_SIZE) {
      for (i=ii; i<ii+BLOCK_SIZE; i++) {
        for (j=jj; j<jj+BLOCK_SIZE; j++) {
          for (k=kk; k<kk+BLOCK_SIZE; k++) {
            //int temp = i*DIM*DIM+j*DIM+k;
            compute = old[i*DIM*DIM+j*DIM+k] * we_need;
            out += compute / gimmie;
          }
        }
      }

    }
  }
}

printf("AGGR:%ld\n",out);

}

First of all, const long BLOCK_SIZE = 8*DIM; seems super fishy to me... Maybe replacing the * by a / would be more of what you wanted?

But even though, you still have to deal with the limits by checking that the i , j and k indexes do not go over their limits. I let you figure out how to achieve that.

Last point on the algorithm: are you sure your loops have to start from index 1?

Finally, a few notes on the OpenMP correctness:

  • although I see nothing wrong in there, declaring firstprivate(we_need, gimmie, dim,old,BLOCK_SIZE) doesn't make much sense. These could happily stay shared .
  • I don't really know whether num_threads(omp_get_num_procs()) is correct or not. My feeling is that it is indeed valid, but just for "safety", I would tend to separate the call to the function from the directive (by either calling the function first and storing its result in a constant, and using it in the directive, or calling omp_set_num_threads() before the parallel directive)
  • when your algorithm is fixed, you might want to consider adding some collapse directive to increase the level of parallelism you achieve here...

Good luck with your code.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM