Optimization with Loop Tiling and OpenMP

Question

Below is my function that I'm trying to optimize using OpenMP and Loop Tiling(aka Loop Blocking). However, my output of out currently gives the wrong value after I apply the loop tiling like below. Can someone look over my code, and point out what makes it wrong. Thank you so much

#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
#include "utils.h"
const long BLOCK_SIZE = 8*DIM;
int i, j, k,ii,jj,kk, dim = DIM-1;

long compute, out = 1.0, we_need, gimmie;

void work_it_par(long *old, long *new)
{
 we_need = need_func();
 gimmie = gimmie_func();

 #pragma omp parallel for private(i,j,k,ii,jj,kk, compute)      firstprivate(we_need, gimmie, dim,old,BLOCK_SIZE) reduction(+:out)   num_threads(omp_get_num_procs())
for (ii=1; ii<dim-BLOCK_SIZE; ii+=BLOCK_SIZE) {
  for (jj=1; jj<dim-BLOCK_SIZE; jj+=BLOCK_SIZE) {
    for (kk=1; kk<dim-BLOCK_SIZE; kk+=BLOCK_SIZE) {
      for (i=ii; i<ii+BLOCK_SIZE; i++) {
        for (j=jj; j<jj+BLOCK_SIZE; j++) {
          for (k=kk; k<kk+BLOCK_SIZE; k++) {
            //int temp = i*DIM*DIM+j*DIM+k;
            compute = old[i*DIM*DIM+j*DIM+k] * we_need;
            out += compute / gimmie;
          }
        }
      }

    }
  }
}

printf("AGGR:%ld\n",out);

}

Answer 1

First of all, const long BLOCK_SIZE = 8*DIM; seems super fishy to me... Maybe replacing the * by a / would be more of what you wanted?

But even though, you still have to deal with the limits by checking that the i , j and k indexes do not go over their limits. I let you figure out how to achieve that.

Last point on the algorithm: are you sure your loops have to start from index 1?

Finally, a few notes on the OpenMP correctness:

although I see nothing wrong in there, declaring firstprivate(we_need, gimmie, dim,old,BLOCK_SIZE) doesn't make much sense. These could happily stay shared .
I don't really know whether num_threads(omp_get_num_procs()) is correct or not. My feeling is that it is indeed valid, but just for "safety", I would tend to separate the call to the function from the directive (by either calling the function first and storing its result in a constant, and using it in the directive, or calling omp_set_num_threads() before the parallel directive)
when your algorithm is fixed, you might want to consider adding some collapse directive to increase the level of parallelism you achieve here...

Good luck with your code.

Optimization with Loop Tiling and OpenMP

Question

1 answers

solution1
1 2019-05-30 05:41:58

Optimization with Loop Tiling and OpenMP

Question

1 answers

solution1 1 2019-05-30 05:41:58

solution1
1 2019-05-30 05:41:58