Why OpenMP reduction is slower than MPI on share memory structure?

Question

I have tried to test OpenMP and MPI parallel implementation for inner products of two vectors (element values are computed on the fly) and find out that OpenMP is slower than MPI. The MPI code I am using is as following,

#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <omp.h>
#include <mpi.h>


int main(int argc, char* argv[])
{
    double ttime = -omp_get_wtime();
    int np, my_rank;
    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &np);
    MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);

    int n = 10000;
    int repeat = 10000;

    int sublength = (int)(ceil((double)(n) / (double)(np)));
        int nstart = my_rank * sublength;
        int nend   = nstart + sublength;
    if (nend >n )
    {
           nend = n;        
       sublength = nend - nstart;
    }   


        double dot = 0;
    double sum = 1;
    
    int j, k;
    double time = -omp_get_wtime();
    for (j = 0; j < repeat; j++)
    {
                double loc_dot = 0;
            for (k = 0; k < sublength; k++)
            {
            double temp = sin((sum+ nstart +k  +j)/(double)(n));
            loc_dot += (temp * temp);
           }
        MPI_Allreduce(&loc_dot, &dot, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
            sum += (dot/(double)(n));
    }
    time += omp_get_wtime();
    if (my_rank == 0)
    {
            ttime += omp_get_wtime();
        printf("np = %d sum = %f, loop time = %f sec, total time = %f \n", np, sum, time, ttime);
    }
        return 0;       
}

I have tried several different implementation with OpenMP. Here is the version which not to complicate and close to best performance I can achieve.

#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <omp.h>


int main(int argc, char* argv[])
{

    int n = 10000;
    int repeat = 10000;


    int np = 1;
    if (argc > 1)
    {
        np = atoi(argv[1]);
    }
        omp_set_num_threads(np);
        
        int nstart =0;
        int sublength =n;

        double loc_dot = 0;
    double sum = 1;
     #pragma omp parallel
     {
    int i, j, k;
        
    double time = -omp_get_wtime();

    for (j = 0; j < repeat; j++)
    {
            #pragma omp for reduction(+: loc_dot)  
            for (k = 0; k < sublength; k++)
            {
            double temp = sin((sum+ nstart +k  +j)/(double)(n));
            loc_dot += (temp * temp);
           }
                #pragma omp single 
                {
           sum += (loc_dot/(double)(n));
           loc_dot =0;
        }
    }
    time += omp_get_wtime();
        #pragma omp single nowait
        printf("sum = %f, time = %f sec, np = %d\n", sum, time, np);
     }
   
   return 0;        
}

here is my test results:

OMP
sum = 6992.953984, time = 0.409850 sec, np = 1
sum = 6992.953984, time = 0.270875 sec, np = 2
sum = 6992.953984, time = 0.186024 sec, np = 4
sum = 6992.953984, time = 0.144010 sec, np = 8
sum = 6992.953984, time = 0.115188 sec, np = 16
sum = 6992.953984, time = 0.195485 sec, np = 32

MPI
sum = 6992.953984, time = 0.381701 sec, np = 1
sum = 6992.953984, time = 0.243513 sec, np = 2
sum = 6992.953984, time = 0.158326 sec, np = 4
sum = 6992.953984, time = 0.102489 sec, np = 8
sum = 6992.953984, time = 0.063975 sec, np = 16
sum = 6992.953984, time = 0.044748 sec, np = 32

Can anyone tell me what I am missing? thanks!

update: I have written an acceptable reduce function for OMP. the perfomance is close to MPI reduce function now. the code is as following.

#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <omp.h>

double darr[2][64];
int    nreduce=0;
#pragma omp threadprivate(nreduce)


double OMP_Allreduce_dsum(double loc_dot,int tid,int np)
{
       darr[nreduce][tid]=loc_dot;
       #pragma omp barrier
       double dsum =0;
       int i;   
       for (i=0; i<np; i++)
       {
           dsum += darr[nreduce][i];
       }
       nreduce=1-nreduce;
       return dsum;
}

int main(int argc, char* argv[])
{


    int np = 1;
    if (argc > 1)
    {
        np = atoi(argv[1]);
    }
        omp_set_num_threads(np);
    double ttime = -omp_get_wtime();

    int n = 10000;
    int repeat = 10000;
        
     #pragma omp parallel
     {
        int tid = omp_get_thread_num();
    int sublength = (int)(ceil((double)(n) / (double)(np)));
        int nstart = tid * sublength;
        int nend   = nstart + sublength;
    if (nend >n )
    {
           nend = n;        
       sublength = nend - nstart;
    }   
        
    double sum = 1;
    double time = -omp_get_wtime();

    int j, k;
    for (j = 0; j < repeat; j++)
    {
                double loc_dot = 0;
            for (k = 0; k < sublength; k++)
            {
            double temp = sin((sum+ nstart +k  +j)/(double)(n));
            loc_dot += (temp * temp);
           }
           double dot =OMP_Allreduce_dsum(loc_dot,tid,np);
           sum +=(dot/(double)(n));
    }
    time += omp_get_wtime();
        #pragma omp master
        { 
       ttime += omp_get_wtime();
       printf("np = %d sum = %f, loop time = %f sec, total time = %f \n", np, sum, time, ttime);
    }
     }
   
   return 0;        
}

Answer 1

First of all, this code is very sensitive to synchronization overheads (both software and hardware) resulting in apparent strange behaviors themselves to both the OpenMP runtime implementation and low-level processor operations (eg. cache/bus effects). Indeed, a full synchronization is required for each iteration of the j -based loop executed every 45 ms. This means 4.5 us/iteration. In such a short time, the partial-sum spread in 32 cores needs to be reduced and broadcasted. If each core accumulates its own value in a shared atomic location, taking for example 60 ns per atomic add (realistic overhead for atomics on scalable Xeon processors), it would take 32 * 60 ns = 1.92 us since this process is done sequentially on x86 processors so far. This small additional time represent an overhead of 43% on the overall execution time because of the barriers! Due to contention on atomic variables, timings are often much worse. Moreover, the barrier themselves are expensive (they are often implemented using atomics in OpenMP runtimes but in a way that could scale a bit better).

The first OpenMP implementation was slow because implicit synchronizations and complex hardware cache effects. Indeed, the omp for reduction directive performs an implicit barrier at the end of its region as well as omp single . The reduction itself can implemented in several ways. The OpenMP runtime of ICC use a clever tree-based atomic implementation which should scale quite well (but not perfectly). Moreover, the omp single section will cause some cache-line bouncing . Indeed, the result loc_dot will likely be stored in the cache of the last core updating it while the thread executing this section will likely scheduled on another core. In this case, the processor has to move the cache-line from one L2 cache to another (or load the value from the L3 cache directly regarding the hardware state). The same thing also apply for sum (which tends to move between cores as the thread executing the section will likely not be always scheduled on the same core). Finally, the sum variable must be broadcasted on each core so they can start a new iteration.

The last OpenMP implementation is significantly better since every thread works on its own local data, it uses only one barrier (this synchronization is mandatory regarding the algorithm) and caches are better used. The accumulation part may not be ideal as all cores will likely fetch data previously located on all other L1/L2 caches causing a all-to-all broadcast pattern . This hardware-operation can scale barely but should be sequential either.

Note that the last OpenMP implementation suffer from false-sharing . Indeed, items of darr will be stored contiguously in memory and share the same cache-line. As a result, when a thread writes in darr , the associated core will request the cache-line and invalidates the ones located on others cores. This causes cache-line bouncing between cores. However, on current x86 processors, cache lines are 64 bytes wise and a double variable takes 8 bytes resulting in 8 items per cache-line. Thus, it mitigates the effect cache-line bouncing typically to 8 cores over the 32 ones. That being said, the item packing has some benefits as only 4 cache-lines fetch are required per core to perform the global accumulation. To prevent false-sharing, one can allocate a (8 times) bigger array and reserve some space between items so that 1 item is stored per cache-line. The best strategy on your target processor may to use a tree-based atomic reduction like the one the ICC OpenMP runtime use. Ideally, the sum reduction and the barrier can be merged together for better performance. This is what the MPI implementation can do internally ( MPI_Allreduce ).

Note that all implementations suffer from the very high thread synchronization. This is a problem as some context switch regularly occurs on some core because of some operating-system/hardware events (network, storage device, user, system processes, etc.). One critical issue is frequency-scaling on any modern x86 processors: not all core will work at the same frequency and their frequency change over time. The slowest thread will slow down all the others because of the barrier. In the worst case, some threads may passively wait enabling some cores to sleep (C-states) and then take more time to wake up slowing further down the others depending on the platform configuration.

The takeaway is:
the more synchronized a code is, the lower its scaling and the challenging its optimization .

Why OpenMP reduction is slower than MPI on share memory structure?

Question

1 answers

solution1
0 2021-07-11 19:47:15

Why OpenMP reduction is slower than MPI on share memory structure?

Question

1 answers

solution1 0 2021-07-11 19:47:15

solution1
0 2021-07-11 19:47:15