简体   繁体   中英

How does openMP parallelize these loops?

Assume I have these loops:

#pragma omp parallel for
   for(int i=0;i<100;++i)
   { 
        // some big code here
#pragma omp parallel for
        for(int j=0;j<200;j++)
        {
            // some small code here
        }
    }

Which loop runs in parallel? Which one is the best to run in parallel?

The main point here is:

1- if the i-loop runs in parallel, since there is some big code there, there is a good chance that CPU cache hits on every iteration of the loop.

2- If the j-loop runs in parallel, since there is not much code there, it probably doesn't hit CPU cache, but I am losing running the big code in parallel.

I don't know how openMP runs these for loops in parallel so I can optimize them?

My code should run on windows (visual studio) and ARM Linux.

Without enabling nesting (environment variable OMP_NESTED=true), only the outer loop will run in parallel.

If you enable nesting, both loops will run in parallel, but probably you will create too many threads.

You could use the omp parallel on the outer loop and for the inner loop use tasks grouping a number of iterations, for example:

#pragma omp parallel for
for (int i = 0; i<100; i++) {
    //big code here

    blocksize = 200/omp_get_num_threads();
    int j = 0;
    while(j < 200) {
        int mystart = j; int myend = j+(blocksize-1);
        #pragma omp task firstprivate(mystart,myend)
        {
            //small code here
        }
        if (j + blocksize >= 200) j = 200 - blocksize;
        else (j+=blocksize);
    }
    #pragma omp taskwait   
}

If you consider to use SIMD in the inner loop, then it can be written quite similar as to what you had:

#pragma omp parallel for
for (int i = 0; i<100; i++) {
    //big code here
    #pragma omp simd
    for (int j = 0; j<200; j++) {
        //small code here
    }   
}

But this latest option is very specific. Basically forces the compiler to vectorize the loop.

More info on the topic. In https://software.intel.com/en-us/articles/enabling-simd-in-program-using-openmp40 you will find an example where they use #pragma omp parallel for simd . That means to parallelize the loop and each thread will run its iteration space with vectorization applied. This will still requiere to enable nesting of parallel regions (OMP_NESTED) and depending on runtime implementation it can generate multiple teams of threads, up to one per each thread of the outer loop.

I agree that experimentation is a great way to learn about parallel programming, and you should try multiple combinations (inner only, outer only, both, something else?) to see what is the best for your code. The rest of my answer will hopefully give you a hint as to why the fastest way is fastest.

Nesting parallel regions can be done, but it is typically not what you want. Consider this question for a similar discussion.

When choosing which loop to parallelize, a common theme is to parallelize outermost loop first for multicore and the innermost loop first for SIMD. There are of course some caveats to this. Not all loops can be parallelized, so in that case you should continue on to the next loop. Additionally, locality, load balancing, and false-sharing may change which loop is optimal.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM