OpenMP代码远比串行内存或线程开销瓶颈慢？

Question

I am trying to parallelize (OpenMP) some scientific C++ code where the bulk (>95%) of the CPU time is spent on a calculating a nasty (and unavoidable) O(N^2) interaction for of order N~200 different particles. 我正在尝试并行化（OpenMP）一些科学的C ++代码，其中大部分（> 95％）的CPU时间用于计算令人讨厌（且不可避免）的O（N ^ 2）交互，以便订购N~200个不同的粒子。 This calculation is repeated for 1e10 time steps. 该计算重复1e10个时间步长。 I have tried various different configurations with OpenMP, each slower than the serial code by some margin (at least order of magnitude) with poor scaling as additional cores are added. 我已尝试使用OpenMP进行各种不同的配置，每个配置比串行代码慢一些（至少数量级），并且随着附加内核的增加而缩放不良。

Below is a sketch of the pertinent code, with representative dummy data hierarchy Tree->Branch->Leaf . 下面是相关代码的草图，具有代表性的虚拟数据层次结构Tree->Branch->Leaf 。 Each Leaf object stores its own position and velocities for current and previous three time steps, amongst other things. 每个Leaf对象存储其自身的位置和速度，用于当前和之前的三个时间步骤等。 Each Branch then stores a collection of Leaf objects and each Tree stores a collection of Branch objects. 然后，每个Branch存储一组Leaf对象，每个Tree存储一个Branch对象的集合。 This data structure works very well for complex but less CPU-intensive calculations that must also be performed at each time step (that have taken months to perfect). 这种数据结构非常适用于复杂但CPU密集度较低的计算，这些计算也必须在每个时间步骤执行（需要数月才能完善）。

#include <omp.h>

#pragma omp parallel num_threads(16) // also tried 2, 4 etc - little difference - hoping that placing this line here spawns the thread pool at the onset rather than at every step
{
while(i < t){
    #pragma omp master
    {
       /* do other calculations on single core, output etc.  */
       Tree.PreProcessing() 
       /* PreProcessing can drastically change data for certain conditions, but only at 3 or 4 of the 1e10 time steps */
       Tree.Output()
    }
    #pragma omp barrier
    #pragma omp for schedule(static) nowait
    for(int k=0; k < size; k++){
         /* do O(N^2) calc that requires position of all other leaves */
         Tree.CalculateInteraction(Branch[k]) 
    }
    /* return to single core to finish time step */
    #pragma omp master
    {
        /* iterate forwards */
        Tree.PropagatePositions()
        i++
    }
    #pragma omp barrier
}

Very briefly, the CPU-hog function does this: 很简单，CPU-hog功能可以做到这一点：

void Tree::CalculateInteraction(Leaf* A){
// for all branches B in tree{
       // for all leaves Q in B{
          if(condition between A and Q){skip}
          else{
                // find displacement D of A and Q 
                // find displacement L of A and "A-1"
                // take cross product of the two displacements
                // add the cross-product to the velocity of leaf A 
                for(int j(0); j!=3; j++){
                    A->Vel[j] += constant * (D_cross_L)[j];
                }

My question is whether this crippling of performance is due to the openMP thread management overhead dominating, or whether it is a case of a data hierarchy designed with no thought to parallelism? 我的问题是，这种严重的性能是由于openMP线程管理开销占主导地位，还是由于设计时没有考虑并行性而导致的数据层次结构？

I should note that each step is timed to be considerably longer in parallel than serial, this isn't some initialisation overhead issue; 我应该注意，每个步骤的时间要比串行时长得多，这不是一些初始化开销问题; the two versions have been tested for calculations that take 1 vs 10 hrs and eventually want to be applied to serial calculations that can take 30 hours (for which getting even a 2 times speed up would be very beneficial). 两个版本已经过测试，计算时间为1对10小时，最终希望应用于可能需要30个小时的连续计算（对于这些计算，加速甚至加速2倍将是非常有益的）。 Also, it may be worth knowing that I'm using g++ 5.2.0 with -fopenmp -march=native -m64 -mfpmath=sse -Ofast -funroll-loops . 此外，可能值得知道我正在使用g ++ 5.2.0和-fopenmp -march=native -m64 -mfpmath=sse -Ofast -funroll-loops 。

I am new to OpenMP so any tips would be greatly appreciated, please let me know if anything should be clarified. 我是OpenMP的新手，所以任何提示都将不胜感激，如果有任何问题需要澄清，请告诉我。

Answer 1

You problem is most likely false sharing due to your use of linked lists for the nodes. 由于您使用了节点的链接列表 ，因此您的问题很可能是错误共享 。 With that memory layout, you not only have the problem of cache misses at almost every time you walk the tree to another node (as mentioned by halfflat). 使用该内存布局，您不仅几乎每次将树移动到另一个节点时都会出现缓存未命中的问题（如halfflat所述）。

A more severe problem is that tree nodes accessed and modified from different threads may be actually close in memory. 更严重的问题是从不同线程访问和修改的树节点实际上可能在内存中接近。 If they share a cache line, this means that false sharing (or cache ping-pong ) causes repeated re-syncing of cache lines shared between different threads. 如果它们共享一个缓存行，这意味着错误共享 （或缓存乒乓 ）会导致重复重新同步不同线程之间共享的缓存行。

The solution to both problems is to avoid linked data structures. 这两个问题的解决方案是避免链接数据结构。 They are almost always the cause of low efficiency. 它们几乎总是效率低下的原因。 In your case, the solution is to first build a linked-list tree with minimal data (only those needed to defined the tree) and then to map that to another tree that doesn't use linked lists and may contain more data. 在您的情况下，解决方案是首先使用最少的数据构建链接列表树（仅限定义树所需的数据），然后将其映射到不使用链接列表且可能包含更多数据的另一个树。 This is what I do and the tree traversal is reasonably fast (tree walk can never be really fast, since cache misses are unavoidable even with contiguous sister nodes, since parent-daughter access cannot be contiguous at the same time). 这就是我所做的，并且树遍历相当快（树行走永远不会非常快，因为即使对于连续的姐妹节点，高速缓存未命中也是不可避免的，因为父女访问不能同时连续）。 A significant speed up (factor>2) can be obtained for the tree building if you add the particles to the new tree in the order of the old tree (this avoids cache misses). 如果按照旧树的顺序将粒子添加到新树（这可以避免缓存未命中），则可以为树构建获得显着的加速（因子> 2）。

Answer 2

Thanks for providing the link to the original source! 感谢您提供原始来源的链接！ I've been able to compile and get some stats on two platforms: a Xeon E5-2670 with icpc 15.0 and g++ 4.9.0; 我已经能够在两个平台上编译并得到一些统计数据：一个带有icpc 15.0和g ++ 4.9.0的Xeon E5-2670; and on a Core i7-4770, with g++ 4.8.4. 在Core i7-4770上，使用g ++ 4.8.4。

On the Xeon, both icpc and g++ produced code that scaled with the number of threads. 在Xeon上，icpc和g ++都生成了与线程数一致的代码。 I ran a shortened (3e-7 second) simulation derived from the run.in file in the distribution: 我运行了一个缩短的（3e-7秒）模拟，该模拟源自分发中的run.in文件：

Xeon E5-2670 / icpc 15.0
threads   time   ipc
---------------------
1         17.5   2.17
2         13.0   1.53
4          6.81  1.53
8          3.81  1.52

Xeon E5-2670 / g++ 4.9.0
threads   time   ipc
---------------------
1         13.2   1.75
2          9.38  1.28
4          5.09  1.27
8          3.07  1.25

On the Core i7, I did see the ugly scaling behaviour you observed, with g++ 4.8.4: 在Core i7上，我确实看到了你观察到的丑陋缩放行为，使用g ++ 4.8.4：

Core i7-4770 / g++ 4.8.4
threads   time   ipc
---------------------
1          8.48  2.41
2         11.5   0.97
4         12.6   0.73

The first observation is that there is something platform-specific affecting the scaling. 第一个观察是有一些特定于平台的因素会影响缩放。

I had a look in the point.h and velnl.cpp files, and noticed that you were using vector<double> variables to store 3-d vector data, including many temporaries. 我查看了point.h和velnl.cpp文件，并注意到你使用vector<double>变量来存储3-d矢量数据，包括许多临时数据。 These will all access the heap, and is a potential source of contention. 这些都将访问堆，并且是潜在的争用来源。 Intel's openmp implementation uses thread-local heaps to avoid heap contention, and perhaps g++ 4.9 does too, while g++-4.8.4 does not? 英特尔的openmp实现使用线程局部堆来避免堆争用，也许g ++ 4.9也可以，而g ++ - 4.8.4则不行吗？

I forked the project ( halfflat/vfmcppar on github) and modified these files to use std::array<double,3> for these 3-d vectors; 我分叉了项目（github上的halfflat/vfmcppar ）并修改了这些文件，以便对这些3-d向量使用std::array<double,3> ; this restores scaling, and also gives much faster run times: 这可以恢复缩放，并且还可以提供更快的运行时间：

Core i7-4770 / g++ 4.8.4
std::array implementation
threads   time   ipc
---------------------
1          1.40  1.54
2          0.84  1.35
4          0.60  1.11

I haven't run these tests on a decent-length simulation, so some scaling could well be lost due to set up and i/o overhead. 我没有在相当长的模拟上运行这些测试，因此由于设置和i / o开销，一些扩展很可能会丢失。

The take-away point is that any shared resource can frustrate scalability, including the heap. 外卖点是任何共享资源都会阻碍可伸缩性，包括堆。

Answer 3

Performance measuring tools (like Linux perf) might give you some information about cache performance or contention; 性能测量工具（如Linux perf）可能会为您提供有关缓存性能或争用的一些信息; the first step in optimisation is measurement! 优化的第一步是测量！

That said, my guess is that this is a data layout problem coupled with the implementation of the velocity update: each thread at any given time is trying to load the data associated with (essentially) a random leaf, which is a recipe for cache thrashing. 也就是说，我的猜测是这是一个数据布局问题，再加上速度更新的实现：任何给定时间的每个线程都试图加载与（本质上）随机叶子相关的数据，这是缓存抖动的一个秘诀。 How large is the data associated with a leaf, and are they arranged to be adjacent in memory? 与叶子相关的数据有多大，它们是否被安排在内存中相邻？

If it is indeed a cache issue (do measure!) then it may well be resolved by tiling the N^2 problem: rather than accumulate the velocity delta contributed by all other leaves, they can be accumulated in batches. 如果它确实是一个缓存问题（做测量！）那么它可以通过平铺N ^ 2问题来解决：而不是累积由所有其他叶子贡献的速度增量，它们可以批量累积。 Consider splitting the N leaves into K batches for the purpose of this calculation, where each batch of leaf data fits in (say) half your cache. 考虑将N个叶子分成K个批次以进行此计算，其中每批叶子数据适合（例如）缓存的一半。 Then iterating over the K^2 pairs (A,B) of batches, perform the interaction steps, that is, compute the contribution of all the leaves in batch B to the leaves in batch A, which should be able to be done in parallel over the leaves in A without thrashing the cache. 然后迭代批次的K ^ 2对（A，B），执行交互步骤，即计算批次B中所有叶子对批次A中叶子的贡献，这应该能够并行完成在A中的叶子上没有颠倒缓存。

Further gains could be made from ensuring that the leaves are arranged in memory batch-wise contiguously. 通过确保叶子连续地分批排列在存储器中，可以获得进一步的收益。

Answer 4

May be unrelated to performance, but the code as it written now has strange parallelization structure. 可能与性能无关，但现在编写的代码具有奇怪的并行化结构。

I doubt it can produce correct results, because the while loop inside the parallel does not have barriers ( omp master does not have barrier, omp for nowait also does not have barrier). 我怀疑它能否产生正确的结果，因为parallel内的while循环没有障碍（ omp master没有障碍， omp for nowait也没有障碍）。

As a result, (1) threads may start omp for loop before the master thread finishes Tree.PreProcessing() , some threads actually may execute omp for any number of times before the master works on single pre-processing step; 结果，（1）线程可以在主线程完成Tree.PreProcessing()之前启动omp for循环，一些线程实际上可以在主机工作于单个预处理步骤之前执行omp for任意次数; (2) master may run Tree.PropagatePositions() before other threads finish the omp for ; （2）master可以在其他线程完成omp for之前运行Tree.PropagatePositions() ; (3) different threads may run different time steps; （3）不同的线程可以运行不同的时间步长; (4) theoretically the master thread may finish all steps of the while loop before some thread even enters parallel region, and thus some iterations of the omp for loop may be never executed at all. （4）理论上，主线程可以在一些线程甚至进入并行区域之前完成while循环的所有步骤，因此omp for循环的一些迭代可能根本不会执行。

Or am I missing something? 或者我错过了什么？

OpenMP代码远比串行内存或线程开销瓶颈慢？

问题描述

4 个解决方案

解决方案1
2 2015-08-12 18:56:29

解决方案2
2 已采纳 2015-08-17 15:18:01

解决方案3
1 2015-08-11 17:03:40

解决方案4
1 2015-08-18 09:30:40

OpenMP代码远比串行内存或线程开销瓶颈慢？

问题描述

4 个解决方案

解决方案1 2 2015-08-12 18:56:29

解决方案2 2 已采纳 2015-08-17 15:18:01

解决方案3 1 2015-08-11 17:03:40

解决方案4 1 2015-08-18 09:30:40

解决方案1
2 2015-08-12 18:56:29

解决方案2
2 已采纳 2015-08-17 15:18:01

解决方案3
1 2015-08-11 17:03:40

解决方案4
1 2015-08-18 09:30:40