I have this kind of nested loop : I want to know how can I parallelize this in best form in which :
Second and third for
and also fifth and sixth for
run at the same time
First and fourth for
in serial
If I have 24 core and want to divide outer for between 16 thread and use rest of them to execute inner for
with them , for example execute second for
with 8 thread not only one thread , what should I do ?
void main()
{
//first_for
for(int y=0; y< height; y++)
{
//second_for
for(int x=0; x< width-1; x++)
{
func1();
}
//third_for
for(int x=0; x< width-1; x++)
{
func2();
}
}
//fourth_for
for(int x=0; x<width; x++)
{
//fifth_for
for(int y=0; y< height-1; y++)
{
func3();
}
//sixth_for
for(int y=0; y< height-1; y++)
{
func4();
}
}
}
Regarding parallelism introduction, it's common to say that the coarser level is better, so if you can add parallel directive at a coarse level that scale well why would you also add nested parallelism ?
so based on what can be run concurrently i would write the main like that:
int main()
{
//first_for
#pragma parallel for
for(int y=0; y< height; y++)
{
//second_for and third_for
for(int x=0; x< width-1; x++)
{
func1();
func2();
}
}
//fourth_for
#pragma parallel for
for(int x=0; x<width; x++)
{
//fifth_for and //sixth_for
for(int y=0; y< height-1; y++)
{
func3();
func4();
}
}
return 0;
}
we increase the work to do per line and per column by merging the 2 inner loop
we add openMP directive to split that computational loop in smaller chunk depending of your number of core.
See if you can invert the first loop because depending of what you do inside and how your "image" is mapped in memory, treating column first may lead to a lot of caching error....
EDIT
you can enable nested parallelism, but it goes in wrong way, too much loop and thread accessing to different chunck of memory will just decrease performance and you also will also have a solution designed for 24 core that may not scale with 32, 48 core etc... But if you insist you have to set an env variable or call an openMP function:
call omp_set_nested()
or
set OMP_NESTED=TRUE|FALSE
after add an openMP clause on your top level loop to specify specify the chunck size you want in order to only have X thread.
int chunckSize = height / X;
#pragma parallel for schedule ( static , chunckSize)
the openMP thread team should be compose of 24 thread, but by doing this only X will have work to do. Follow that logic for nested loop.
But it's not the solution I recommend!
In addition to what has been said, you may want to explicitly enable nested parallelism. It is possible to do so with either a library call at run-time or an environment variable (for OpenMP).
For more information, check out this Oracle Docs .
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.