简体   繁体   中英

Why should I use a reduction rather than an atomic variable?

Assume we want to count something in an OpenMP loop. Compare the reduction

int counter = 0;
#pragma omp for reduction( + : counter )
for (...) {
    ...
    counter++;
}

with the atomic increment

int counter = 0;
#pragma omp for
for (...) {
    ...
    #pragma omp atomic
    counter++
}

The atomic access provides the result immediately, while a reduction only assumes its correct value at the end of the loop. For instance, reductions do not allow this:

int t = counter;
if (t % 1000 == 0) {
    printf ("%dk iterations\n", t/1000);
}

thus providing less functionality.

Why would I ever use a reduction instead of atomic access to a counter?

Short answer:

Performance

Long Answer:

Because an atomic variable comes with a price, and this price is synchronization. In order to ensure that there is no race conditions ie two threads modifying the same variable at the same moment, threads must synchronize which effectively means that you lose parallelism, ie threads are serialized .

Reduction on the other hand is a general operation that can be carried out in parallel using parallel reduction algorithms. Read this and this articles for more info about parallel reduction algorithms.


Addendum: Getting a sense of how a parallel reduction work

Imagine a scenario where you have 4 threads and you want to reduce a 8 element array A. What you could do this in 3 steps (check the attached image to get a better sense of what I am talking about):

  • Step 0 . Threads with index i<4 take care of the result of summing A[i]=A[i]+A[i+4] .
  • Step 1 . Threads with index i<2 take care of the result of summing A[i]=A[i]+A[i+4/2] .
  • Step 2 . Threads with index i<4/4 take care of the result of summing A[i]=A[i]+A[i+4/4]

At the end of this process you will have the result of your reduction in the first element of A ie A[0]

在此输入图像描述

Performance is the key point.

Consider the following program

#include <stdio.h>
#include <omp.h>
#define N 1000000
int a[N], sum;

int main(){
  double begin, end;

  begin=omp_get_wtime();
  for(int i =0; i<N; i++)
    sum+=a[i];
  end=omp_get_wtime();
  printf("serial %g\t",end-begin);

  begin=omp_get_wtime();
# pragma omp parallel for
  for(int i =0; i<N; i++)
# pragma omp atomic
    sum+=a[i];
  end=omp_get_wtime();
  printf("atomic %g\t",end-begin);

  begin=omp_get_wtime();
# pragma omp parallel for reduction(+:sum)
  for(int i =0; i<N; i++)
    sum+=a[i];
  end=omp_get_wtime();
  printf("reduction %g\n",end-begin);
}

When executed (gcc -O3 -fopenmp), it gives :

serial 0.00491182 atomic 0.0786559 reduction 0.001103

So approximately atomic=20xserial=80xreduction

The 'reduction' exploits properly the parallelism, and with a 4 cores computer, we can get 3--6 performances boosts vs "serial".

Now, "atomic" is 20 times longer than "serial". Not only, as explained in the previous answer, the serialization of memory accesses disables parallelism, but all memory accesses are done by atomic operations. These operations require at least 20--50 cycles on modern computers and will dramatically slow down your performances if used intensively.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM