parallelize nested loop in OpenMP and do inner loop with more thread

Question

I have this kind of nested loop : I want to know how can I parallelize this in best form in which :

Second and third for and also fifth and sixth for run at the same time
First and fourth for in serial

If I have 24 core and want to divide outer for between 16 thread and use rest of them to execute inner for with them , for example execute second for with 8 thread not only one thread , what should I do ?

void main()
{
//first_for
   for(int y=0; y< height; y++)
{
      //second_for 
      for(int x=0; x< width-1; x++)
   {
     func1();
   }
      //third_for
      for(int x=0; x< width-1; x++)
   {
     func2();
   }

}
//fourth_for
   for(int x=0; x<width; x++)
{
     //fifth_for
     for(int y=0; y< height-1; y++)
   {
     func3();
   }
     //sixth_for
     for(int y=0; y< height-1; y++)
   {
     func4();
   }
}
}

Answer 1

Regarding parallelism introduction, it's common to say that the coarser level is better, so if you can add parallel directive at a coarse level that scale well why would you also add nested parallelism ?

so based on what can be run concurrently i would write the main like that:

int main()
{
     //first_for
     #pragma parallel for
     for(int y=0; y< height; y++)
     {
          //second_for and third_for
      for(int x=0; x< width-1; x++)
     {
          func1();
          func2();

      }
  }
 //fourth_for
 #pragma parallel for
 for(int x=0; x<width; x++)
 {
      //fifth_for and  //sixth_for
     for(int y=0; y< height-1; y++)
    {
          func3();
          func4();
     }
  }
 return 0;
}

we increase the work to do per line and per column by merging the 2 inner loop
we add openMP directive to split that computational loop in smaller chunk depending of your number of core.
See if you can invert the first loop because depending of what you do inside and how your "image" is mapped in memory, treating column first may lead to a lot of caching error....

EDIT

you can enable nested parallelism, but it goes in wrong way, too much loop and thread accessing to different chunck of memory will just decrease performance and you also will also have a solution designed for 24 core that may not scale with 32, 48 core etc... But if you insist you have to set an env variable or call an openMP function:

 call omp_set_nested()
 or
 set OMP_NESTED=TRUE|FALSE

after add an openMP clause on your top level loop to specify specify the chunck size you want in order to only have X thread.

int  chunckSize = height / X;
#pragma parallel for schedule ( static , chunckSize)

the openMP thread team should be compose of 24 thread, but by doing this only X will have work to do. Follow that logic for nested loop.

But it's not the solution I recommend!

Answer 2

In addition to what has been said, you may want to explicitly enable nested parallelism. It is possible to do so with either a library call at run-time or an environment variable (for OpenMP).

For more information, check out this Oracle Docs .

parallelize nested loop in OpenMP and do inner loop with more thread

Question

2 answers

solution1
0 2013-07-23 15:59:23

solution2
0 2013-07-23 17:02:46

parallelize nested loop in OpenMP and do inner loop with more thread

Question

2 answers

solution1 0 2013-07-23 15:59:23

solution2 0 2013-07-23 17:02:46

solution1
0 2013-07-23 15:59:23

solution2
0 2013-07-23 17:02:46