简体   繁体   中英

Increasing array index in openMP

I am new to using OpenMP. I am trying to parallelize a nested loop, and so far I have something of this form...

#pragma omp parallel for
for (j=0;j <m; j++) {
    some work;
    for (i= 0; i < n ; i++) {
        p =b[i];
        if (P< 0 && k < m) {
            a[k] = c[i]; k++ ;
        } else {
            x=c[i];
        }
    }
    some work
}

The outer loop is in parallel, and the inner loop updates k . The current value of k is needed for the other threads to update a[k] correctly. The problem is that all of the threads are updating a[k] , but the proper order of k is not kept.

Some threads will update k and a[k] , and some will not. How do I communicate the latest k between threads to update a[k] properly, since c[i] will have different values for each thread?

For example, when it runs serially, the program might set the first seven values of a to {1,3,5,7,3,9,13} and terminate with k equal to 7, but when done parallel, produces different results, or results in a different (therefore wrong) order.

How do I keep the same order and ensure parallelism at the same time?

Note : this answer was completely rewritten in light of OP clarifications. The original answer text is at the end.

How do I keep the same order and ensure parallelism at the same time?

Order dependency is antithetical to parallelism, as running operations in parallel inherently entails relaxing the relative order in which they are performed. Not all computations can be effectively parallelized.

Your case is not an exception. The second and each subsequent iteration of your outer loop needs to use the final value of k (among other things) computed by the previous iteration. How can it get that? Only by performing the previous iteration first. What room does that leave for concurrent operation? None. Concurrency is not the same thing as parallelism, but it is one of the main motivations for parallelism, because that's how parallelism yields improvements in elapsed time.

With no scope for concurrency, parallelism is actively counterproductive for you. Suppose you made the whole body of the outer loop a critical section, so that there was no concurrency in fact (as your present code requires) and no data races involving k . Then you would still pay the overhead for parallelism, get no speedup in return, and probably still get the wrong results because of evaluations of the outer-loop body being performed in the wrong order.

It may be that the whole thing can be rewritten to reduce or remove the data dependencies that prevent effective parallelization of the computation, or it may not. We haven't enough information to determine, as it depends in part on the details of " some work " and on the significance of the data. Probably you would need an altogether different algorithm for producing the desired results.

> Instead of giving a[n]={0,1,2,3,.......n}, it gives me garbage values for a when I use the reduction clause. I need the total sum of K, hence the reduction clause.

There is a closed-form equation for the sum of consecutive integers, and it has especially simple form when the first integer in the list is 0 or 1. In particular, the sum of the integers from 0 to n , inclusive, is n * (n + 1) / 2 . You do not need a reduction for this.

If you wanted to use a reduction anyway, then you need to understand that it doesn't work the way you seem to think it does. What you get is a separate, private copy of the reduction variable for each thread executing the parallel construct, with the per thread (not per iteration) final values of those independant variables combined according to the reduction operator. Thus, if you really want to do the computation via an OpenMP reduction, then you would need to restructure the loop something like this:

 #pragma omp parallel for reduction (+:k) for (i = 0; i < 10; i++) { a[i] = i; k += i; }

That assumes that the value of k is 0 immediately prior to the loop, as you indeed seem to be doing. If that were not a safe assumption then you would need something like

type_of_k k0 = k; k = 0; #pragma omp parallel for reduction (+:k) for (i = 0; i < 10; i++) { a[k0 + i] = i; k += k0 + i; }

Note that in either case, not only does that set up the reduction correctly, but it also breaks the data dependency between loop iterations that was previously carried by the expression k++ .

It sounds like you're essentially filling in a with a filter of entries from c , and want to preserve their order. If this is the only use k has, some other methods spring to mind:

  1. Always write a[i] , but use a mark indicating unused values where the P predicate wasn't satisfied. This preserves order, but requires a larger a you can compact in a second pass.

  2. Write an a_i array storing which index each entry belonged to. This still requires a #pragma omp atomic k_local = k++ access to k , and a second sort to restore order. And you'd need both a and a_i to be the full size again, or you might miss entries, so in all a terrible workaround.

Even with some sequential dependencies you can do optimizations, eg a scan to calculate what k would be for each i could be done in O(log n) rather than O(n). Eg parallel prefix sum , openmp discussion on stack overflow . This sort of thing is what OpenMP's ordered depend is for, I believe. Anyhow, this leads to the third solution:

  1. Generate a k array, holding the values k will have for each iteration, such that those threads that will write write to the correct places. This requires scanning the predicate.

It is useful to have higher level constructs like map, scan and reduce when planning out algorithms.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM