为什么 OpenMP 减少在共享内存结构上比 MPI 慢？

Question

I have tried to test OpenMP and MPI parallel implementation for inner products of two vectors (element values are computed on the fly) and find out that OpenMP is slower than MPI.我试图测试两个向量的内积（元素值是动态计算的）的 OpenMP 和 MPI 并行实现，并发现 OpenMP 比 MPI 慢。 The MPI code I am using is as following,我使用的 MPI 代码如下，

#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <omp.h>
#include <mpi.h>


int main(int argc, char* argv[])
{
    double ttime = -omp_get_wtime();
    int np, my_rank;
    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &np);
    MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);

    int n = 10000;
    int repeat = 10000;

    int sublength = (int)(ceil((double)(n) / (double)(np)));
        int nstart = my_rank * sublength;
        int nend   = nstart + sublength;
    if (nend >n )
    {
           nend = n;        
       sublength = nend - nstart;
    }   


        double dot = 0;
    double sum = 1;
    
    int j, k;
    double time = -omp_get_wtime();
    for (j = 0; j < repeat; j++)
    {
                double loc_dot = 0;
            for (k = 0; k < sublength; k++)
            {
            double temp = sin((sum+ nstart +k  +j)/(double)(n));
            loc_dot += (temp * temp);
           }
        MPI_Allreduce(&loc_dot, &dot, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
            sum += (dot/(double)(n));
    }
    time += omp_get_wtime();
    if (my_rank == 0)
    {
            ttime += omp_get_wtime();
        printf("np = %d sum = %f, loop time = %f sec, total time = %f \n", np, sum, time, ttime);
    }
        return 0;       
}

I have tried several different implementation with OpenMP.我已经尝试了几种不同的 OpenMP 实现。 Here is the version which not to complicate and close to best performance I can achieve.这是我可以实现的不复杂且接近最佳性能的版本。

#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <omp.h>


int main(int argc, char* argv[])
{

    int n = 10000;
    int repeat = 10000;


    int np = 1;
    if (argc > 1)
    {
        np = atoi(argv[1]);
    }
        omp_set_num_threads(np);
        
        int nstart =0;
        int sublength =n;

        double loc_dot = 0;
    double sum = 1;
     #pragma omp parallel
     {
    int i, j, k;
        
    double time = -omp_get_wtime();

    for (j = 0; j < repeat; j++)
    {
            #pragma omp for reduction(+: loc_dot)  
            for (k = 0; k < sublength; k++)
            {
            double temp = sin((sum+ nstart +k  +j)/(double)(n));
            loc_dot += (temp * temp);
           }
                #pragma omp single 
                {
           sum += (loc_dot/(double)(n));
           loc_dot =0;
        }
    }
    time += omp_get_wtime();
        #pragma omp single nowait
        printf("sum = %f, time = %f sec, np = %d\n", sum, time, np);
     }
   
   return 0;        
}

here is my test results:这是我的测试结果：

OMP
sum = 6992.953984, time = 0.409850 sec, np = 1
sum = 6992.953984, time = 0.270875 sec, np = 2
sum = 6992.953984, time = 0.186024 sec, np = 4
sum = 6992.953984, time = 0.144010 sec, np = 8
sum = 6992.953984, time = 0.115188 sec, np = 16
sum = 6992.953984, time = 0.195485 sec, np = 32

MPI
sum = 6992.953984, time = 0.381701 sec, np = 1
sum = 6992.953984, time = 0.243513 sec, np = 2
sum = 6992.953984, time = 0.158326 sec, np = 4
sum = 6992.953984, time = 0.102489 sec, np = 8
sum = 6992.953984, time = 0.063975 sec, np = 16
sum = 6992.953984, time = 0.044748 sec, np = 32

Can anyone tell me what I am missing?谁能告诉我我错过了什么？ thanks!谢谢！

update: I have written an acceptable reduce function for OMP.更新：我为 OMP 编写了一个可接受的 reduce 函数。 the perfomance is close to MPI reduce function now.现在性能已经接近MPI的reduce功能了。 the code is as following.代码如下。

#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <omp.h>

double darr[2][64];
int    nreduce=0;
#pragma omp threadprivate(nreduce)


double OMP_Allreduce_dsum(double loc_dot,int tid,int np)
{
       darr[nreduce][tid]=loc_dot;
       #pragma omp barrier
       double dsum =0;
       int i;   
       for (i=0; i<np; i++)
       {
           dsum += darr[nreduce][i];
       }
       nreduce=1-nreduce;
       return dsum;
}

int main(int argc, char* argv[])
{


    int np = 1;
    if (argc > 1)
    {
        np = atoi(argv[1]);
    }
        omp_set_num_threads(np);
    double ttime = -omp_get_wtime();

    int n = 10000;
    int repeat = 10000;
        
     #pragma omp parallel
     {
        int tid = omp_get_thread_num();
    int sublength = (int)(ceil((double)(n) / (double)(np)));
        int nstart = tid * sublength;
        int nend   = nstart + sublength;
    if (nend >n )
    {
           nend = n;        
       sublength = nend - nstart;
    }   
        
    double sum = 1;
    double time = -omp_get_wtime();

    int j, k;
    for (j = 0; j < repeat; j++)
    {
                double loc_dot = 0;
            for (k = 0; k < sublength; k++)
            {
            double temp = sin((sum+ nstart +k  +j)/(double)(n));
            loc_dot += (temp * temp);
           }
           double dot =OMP_Allreduce_dsum(loc_dot,tid,np);
           sum +=(dot/(double)(n));
    }
    time += omp_get_wtime();
        #pragma omp master
        { 
       ttime += omp_get_wtime();
       printf("np = %d sum = %f, loop time = %f sec, total time = %f \n", np, sum, time, ttime);
    }
     }
   
   return 0;        
}

Answer 1

First of all, this code is very sensitive to synchronization overheads (both software and hardware) resulting in apparent strange behaviors themselves to both the OpenMP runtime implementation and low-level processor operations (eg. cache/bus effects).首先，此代码对同步开销（软件和硬件）非常敏感，导致 OpenMP 运行时实现和低级处理器操作（例如缓存/总线效果）本身出现明显的奇怪行为。 Indeed, a full synchronization is required for each iteration of the j -based loop executed every 45 ms.实际上，每 45 毫秒执行一次的基于j的循环的每次迭代都需要完全同步。 This means 4.5 us/iteration.这意味着 4.5 us/迭代。 In such a short time, the partial-sum spread in 32 cores needs to be reduced and broadcasted.在这么短的时间内，需要减少和广播在 32 个内核中传播的部分和。 If each core accumulates its own value in a shared atomic location, taking for example 60 ns per atomic add (realistic overhead for atomics on scalable Xeon processors), it would take 32 * 60 ns = 1.92 us since this process is done sequentially on x86 processors so far.如果每个内核在共享原子位置累积自己的值，例如每个原子添加 60 ns（可扩展至强处理器上原子的实际开销），则将需要32 * 60 ns = 1.92 us因为此过程在 x86 上按顺序完成处理器到目前为止。 This small additional time represent an overhead of 43% on the overall execution time because of the barriers!由于障碍，这个小的额外时间代表了总执行时间的 43% 的开销！ Due to contention on atomic variables, timings are often much worse.由于对原子变量的争用，时序通常要差得多。 Moreover, the barrier themselves are expensive (they are often implemented using atomics in OpenMP runtimes but in a way that could scale a bit better).此外，屏障本身很昂贵（它们通常在 OpenMP 运行时中使用原子实现，但可以更好地扩展）。

The first OpenMP implementation was slow because implicit synchronizations and complex hardware cache effects.由于隐式同步和复杂的硬件缓存效应，第一个 OpenMP 实现很慢。 Indeed, the omp for reduction directive performs an implicit barrier at the end of its region as well as omp single .实际上， omp for reduction指令在其区域的末尾和omp single执行隐式屏障。 The reduction itself can implemented in several ways.减少本身可以通过多种方式实现。 The OpenMP runtime of ICC use a clever tree-based atomic implementation which should scale quite well (but not perfectly). ICC 的 OpenMP 运行时使用一个聪明的基于树的原子实现，它应该可以很好地扩展（但并不完美）。 Moreover, the omp single section will cause some cache-line bouncing .此外， omp single节会导致一些缓存行反弹。 Indeed, the result loc_dot will likely be stored in the cache of the last core updating it while the thread executing this section will likely scheduled on another core.实际上，结果loc_dot可能会存储在更新它的最后一个内核的缓存中，而执行此部分的线程可能会在另一个内核上进行调度。 In this case, the processor has to move the cache-line from one L2 cache to another (or load the value from the L3 cache directly regarding the hardware state).在这种情况下，处理器必须将缓存行从一个 L2 缓存移动到另一个缓存（或直接从 L3 缓存加载与硬件状态相关的值）。 The same thing also apply for sum (which tends to move between cores as the thread executing the section will likely not be always scheduled on the same core).同样的事情也适用于sum （它倾向于在内核之间移动，因为执行该部分的线程可能不会总是被安排在同一个内核上）。 Finally, the sum variable must be broadcasted on each core so they can start a new iteration.最后，必须在每个核心上广播sum变量，以便它们可以开始新的迭代。

The last OpenMP implementation is significantly better since every thread works on its own local data, it uses only one barrier (this synchronization is mandatory regarding the algorithm) and caches are better used.最后一个 OpenMP 实现要好得多，因为每个线程都处理自己的本地数据，它只使用一个屏障（这种同步对于算法来说是强制性的）并且更好地使用缓存。 The accumulation part may not be ideal as all cores will likely fetch data previously located on all other L1/L2 caches causing a all-to-all broadcast pattern .累积部分可能并不理想，因为所有内核可能会获取先前位于所有其他 L1/L2 缓存中的数据，从而导致全对全广播模式。 This hardware-operation can scale barely but should be sequential either.这种硬件操作几乎不能扩展，但也应该是顺序的。

Note that the last OpenMP implementation suffer from false-sharing .请注意，最后一个 OpenMP 实现受到false-sharing 的影响。 Indeed, items of darr will be stored contiguously in memory and share the same cache-line.实际上， darr项将连续存储在内存中并共享相同的缓存行。 As a result, when a thread writes in darr , the associated core will request the cache-line and invalidates the ones located on others cores.结果，当一个线程写入darr ，相关的内核将请求缓存行并使位于其他内核上的缓存行无效。 This causes cache-line bouncing between cores.这会导致内核之间的缓存线反弹。 However, on current x86 processors, cache lines are 64 bytes wise and a double variable takes 8 bytes resulting in 8 items per cache-line.然而，在当前的 x86 处理器上，缓存行是 64 字节明智的， double变量占用 8 字节，导致每个缓存行有 8 个项目。 Thus, it mitigates the effect cache-line bouncing typically to 8 cores over the 32 ones.因此，它减轻了缓存线弹跳的影响，通常为 8 个内核而不是 32 个内核。 That being said, the item packing has some benefits as only 4 cache-lines fetch are required per core to perform the global accumulation.话虽如此，项目打包有一些好处，因为每个内核只需要 4 个缓存行获取来执行全局累积。 To prevent false-sharing, one can allocate a (8 times) bigger array and reserve some space between items so that 1 item is stored per cache-line.为了防止错误共享，可以分配一个（8 倍）更大的数组并在项目之间保留一些空间，以便每个缓存行存储 1 个项目。 The best strategy on your target processor may to use a tree-based atomic reduction like the one the ICC OpenMP runtime use.目标处理器上的最佳策略可能是使用基于树的原子归约，就像 ICC OpenMP 运行时使用的那样。 Ideally, the sum reduction and the barrier can be merged together for better performance.理想情况下， sum减少和障碍可以合并在一起以获得更好的性能。 This is what the MPI implementation can do internally ( MPI_Allreduce ).这是 MPI 实现可以在内部执行的操作 ( MPI_Allreduce )。

Note that all implementations suffer from the very high thread synchronization.请注意，所有实现都受到非常高的线程同步的影响。 This is a problem as some context switch regularly occurs on some core because of some operating-system/hardware events (network, storage device, user, system processes, etc.).这是一个问题，因为由于某些操作系统/硬件事件（网络、存储设备、用户、系统进程等），某些核心上会定期发生某些上下文切换。 One critical issue is frequency-scaling on any modern x86 processors: not all core will work at the same frequency and their frequency change over time.一个关键问题是任何现代 x86 处理器上的频率缩放：并非所有内核都以相同的频率工作，并且它们的频率会随着时间而变化。 The slowest thread will slow down all the others because of the barrier.由于障碍，最慢的线程会减慢所有其他线程的速度。 In the worst case, some threads may passively wait enabling some cores to sleep (C-states) and then take more time to wake up slowing further down the others depending on the platform configuration.在最坏的情况下，某些线程可能会被动等待，使某些内核进入睡眠状态（C 状态），然后根据平台配置花费更多时间唤醒其他内核，从而进一步减慢其他内核的速度。

The takeaway is:要点是：
the more synchronized a code is, the lower its scaling and the challenging its optimization .代码越同步，它的扩展性就越低，优化的难度就越大。

为什么 OpenMP 减少在共享内存结构上比 MPI 慢？

问题描述

1 个解决方案

解决方案1
0 2021-07-11 19:47:15

为什么 OpenMP 减少在共享内存结构上比 MPI 慢？

问题描述

1 个解决方案

解决方案1 0 2021-07-11 19:47:15

解决方案1
0 2021-07-11 19:47:15