OpenMP - 在外部循环之前进行并行时，嵌套for循环变得更快。为什么？

Question

I'm currently implementing an dynamic programming algorithm for solving knapsack problems. 我目前正在实现一个动态编程算法来解决背包问题。 Therefore my code has two for-loops, an outer and an inner loop. 因此我的代码有两个for循环，一个外循环和一个内循环。

From the logical point of view I can parallelize the inner for loop since the calculations there are independant from each other. 从逻辑的角度来看，我可以并行化内部for循环，因为它们之间的计算是相互独立的。 The outer for loop can not be parallelized because of dependancies. 由于依赖性，外部for循环不能并行化。

So this was my first approach : 所以这是我的第一个方法 ：

for(int i=1; i < itemRows; i++){
        int itemsIndex = i-1;
        int itemWeight = integerItems[itemsIndex].weight;
        int itemWorth = integerItems[itemsIndex].worth;

        #pragma omp parallel for if(weightColumns > THRESHOLD)
        for(int c=1; c < weightColumns; c++){
            if(c < itemWeight){
                table[i][c] = table[i-1][c];
            }else{
                int worthOfNotUsingItem = table[i-1][c];
                int worthOfUsingItem = itemWorth + table[i-1][c-itemWeight];
                table[i][c] = worthOfNotUsingItem < worthOfUsingItem ? worthOfUsingItem : worthOfNotUsingItem;
            }
        }
}

The code works well, the algorithm solves the problems correctly. 代码运行良好，算法正确解决了问题。 Then I was thinking about optimizing this, since I was not sure how OpenMP's thread management works. 然后我考虑优化这个，因为我不确定OpenMP的线程管理是如何工作的。 I wanted to prevent unnecessary initialization of the threads during each iteration, thus I put an outer parallel block around the outer loop. 我想在每次迭代期间防止不必要的线程初始化，因此我在外部循环周围放置了一个外部并行块。

Second approach: 第二种方法：

#pragma omp parallel if(weightColumns > THRESHOLD)
{
    for(int i=1; i < itemRows; i++){
        int itemsIndex = i-1;
        int itemWeight = integerItems[itemsIndex].weight;
        int itemWorth = integerItems[itemsIndex].worth;

        #pragma omp for
        for(int c=1; c < weightColumns; c++){
            if(c < itemWeight){
                table[i][c] = table[i-1][c];
            }else{
                int worthOfNotUsingItem = table[i-1][c];
                int worthOfUsingItem = itemWorth + table[i-1][c-itemWeight];
                table[i][c] = worthOfNotUsingItem < worthOfUsingItem ? worthOfUsingItem : worthOfNotUsingItem;
            }
        }
     }
}

This has an unwanted side effect: Everything inside the parallel block will now be executed n-times, where n is the number of available cores. 这有一个不必要的副作用：并行块内的所有内容现在将执行n次，其中n是可用内核的数量。 I already tried to work with pragmas single and critical to force the outer for-loop to be executed in one thread, but then I can not calculate the inner loop by multiple threads unless I open a new parallel block (but then there would be no gain in speed). 我已经尝试使用pragma single和critical来强制外部for循环在一个线程中执行，但是我无法通过多个线程计算内部循环，除非我打开一个新的并行块（但之后就没有了）获得速度）。 But nevermind, because the good thing is: this does not affect the result. 但是没关系，因为好事是：这不会影响结果。 The problems are still solved correctly. 问题仍然正确解决。

NOW THE STRANGE THING: The second approach is FASTER than the first one! 现在的奇怪之处：第二种方法比第一种方法更快！

How can this be? 怎么会这样？ I mean, although the outer for-loop is calculated n times (in parallel) and the inner for-loop is distributed n times among n cores it is faster than the first approach which only calculates the outer loop one time and distributes the workload of the inner for-loop evenly. 我的意思是，虽然外部for循环计算n次（并行）并且内部for循环在n个核心中分布n次，但它比第一种方法更快，它只计算外部循环一次并分配工作量内部for循环均匀。

At first I was thinking: "well yeah, it's probably because of thread management" but then I read that OpenMP pools the instantiated threads which would speak against my assumption. 起初我在想：“好吧，这可能是因为线程管理”但后来我读到OpenMP池实例化的线程会反对我的假设。 Then I disabled compiler optimization (compiler flag -O0) to check if it has something to do with. 然后我禁用了编译器优化（编译器标志-O0）以检查它是否与之有关。 But this did not affect the measurement. 但这并没有影响测量。

Can anybody of you shed more light into this please? 你们中的任何人都可以为此更多地发光吗？

The measured times for solving the knapsack-problem containing 7500 items with a max capacity of 45000 (creating a matrix of 7500x45000, which is way over the used THRESHOLD variable in code): 用于解决背包问题的测量时间包含7500个项目，最大容量为45000（创建7500x45000的矩阵，这比代码中使用的THRESHOLD变量大）：

Approach 1: ~0.88s 方法1：~0.88s
Approach 2: ~0.52s 方法2：~0.52s

Thanks in advance, 提前致谢，

phineliner phineliner

EDIT : 编辑：

Measurement of a more complex problem: Added 2500 items to the problem (from 7500 to 10000) (more complex problems can't currently be handled because of memory reasons). 测量更复杂的问题：为问题添加了2500个项目（从7500到10000）（由于内存原因，目前无法处理更复杂的问题）。

Approach 1: ~1.19s 方法1：~1.19s
Approach 2: ~0.71s 方法2：~0.71s

EDIT2 : I was mistaken about the compiler optimization. EDIT2 ：我误解了编译器优化。 This does not affect measurement. 这不会影响测量。 At least I can not reproduce the difference I measured before. 至少我不能重现我之前测量的差异。 I edited the question text according to this. 我根据这个编辑了问题文本。

Answer 1

Let's first consider what your code is doing. 我们首先考虑您的代码正在做什么。 Essentially your code is transforming a matrix (2D array) where the values of the rows depend on the previous row but the values of the columns are independent of other columns. 基本上，您的代码正在转换矩阵（2D数组），其中行的值取决于前一行，但列的值独立于其他列。 Let me choose a simpler example of this 让我选择一个更简单的例子

for(int i=1; i<n; i++) {
    for(int j=0; j<n; j++) {
        a[i*n+j] += a[(i-1)*n+j];
    }
}

One way to parallelize this is to swap the loops like this 并行化的一种方法是像这样交换循环

Method 1: 方法1：

#pragma omp parallel for
for(int j=0; j<n; j++) {
    for(int i=1; i<n; i++) {
        a[i*n+j] += a[(i-1)*n+j];
    }
}

With this method each thread runs all n-1 iterations of i of the inner loop but only n/nthreads iterations of j . 使用此方法，每个线程运行内部循环的i的所有n-1次迭代，但仅n/nthreads j n/nthreads迭代。 This effectively processes strips of columns in parallel. 这有效地并行处理了条带。 However, this method is highly cache unfriendly. 但是，这种方法对高速缓存不友好。

Another possibility is to only parallelize the inner loop. 另一种可能性是仅内化循环。

Method 2: 方法2：

for(int i=1; i<n; i++) {
    #pragma omp parallel for 
    for(int j=0; j<n; j++) {
        a[i*n+j] += a[(i-1)*n+j];
    }
}

This essentially processes the columns in a single row in parallel but each row sequentially. 这实质上是并行处理单行中的列，但每行依次处理。 The values of i are only run by the master thread. i的值仅由主线程运行。

Another way to process the columns in parallel but each row sequentially is: 另一种并行处理列的方法，但每行按顺序处理：

Method 3: 方法3：

#pragma omp parallel
for(int i=1; i<n; i++) {
    #pragma omp for
    for(int j=0; j<n; j++) {
        a[i*n+j] += a[(i-1)*n+j];
    }
}

In this method, like method 1, each thread runs over all n-1 iteration over i . 在这个方法中，与方法1一样，每个线程在i上的所有n-1次迭代中运行。 However, this method has an implicit barrier after the inner loop which causes each thread to pause until all threads have finished a row making this method sequential for each row like method 2. 但是，此方法在内部循环之后具有隐式屏障，导致每个线程暂停，直到所有线程完成一行，使得此方法对于每一行都是顺序的，如方法2。

The best solution is one which processes strips of columns in parallel like method 1 but is still cache friendly. 最好的解决方案是像方法1一样并行处理列条，但仍然是缓存友好的。 This can be achieved using the nowait clause. 这可以使用nowait子句来实现。

Method 4: 方法4：

#pragma omp parallel
for(int i=1; i<n; i++) {
    #pragma omp for nowait
    for(int j=0; j<n; j++) {
        a[i*n+j] += a[(i-1)*n+j];
    }
}

In my tests the nowait clause does not make much difference. 在我的测试中， nowait子句没有太大区别。 This is probably because the load is even (which is why static scheduling is ideal in this case). 这可能是因为负载是均匀的（这就是为什么静态调度在这种情况下是理想的）。 If the load was less even nowait would probably make more of a difference. 如果负载较少甚至nowait很可能会令更多的差别。

Here are the times in seconds for n=3000 on my my four cores IVB system GCC 4.9.2: 以下是我的四核IVB系统GCC 4.9.2上n=3000秒数：

method 1: 3.00
method 2: 0.26 
method 3: 0.21
method 4: 0.21

This test is probably memory bandwidth bound so I could have chosen a better case using more computation but nevertheless the differences are significant enough. 这个测试可能是内存带宽限制所以我可以选择更好的情况使用更多的计算，但不过差异非常大。 In order to remove a bias due to creating the thread pool I ran one of the methods without timing it first. 为了消除由于创建线程池而产生的偏差，我运行了一种方法而没有先计时。

It's clear from the timing how un-cache friendly method 1 is. 从时序中可以清楚地知道un-cache friendly方法1是如何进行的。 It's also clear method 3 is faster than method 2 and that nowait has little effect in this case. 同样明显的方法3比方法2速度更快，并且nowait在这种情况下，收效甚微。

Since method 2 and method 3 both processes columns in a row in parallel but rows sequentially one might expect their timing to be the same. 由于方法2和方法3都并行处理行中的列，但是顺序地处理行，因此可能期望它们的时序相同。 So why do they differ? 那他们为什么不同呢？ Let me make some observations: 让我提一些意见：

Due to a thread pool the threads are not created and destroyed for each iteration of the outer loop of method 2 so it's not clear to me what the extra overhead is. 由于线程池，不会为方法2的外部循环的每次迭代创建和销毁线程，因此我不清楚额外的开销是多少。 Note that OpenMP says nothing about a thread pool. 请注意，OpenMP没有提及线程池。 This is something that each compiler implements. 这是每个编译器实现的。
The only other difference between method 3 and method 2 is that in method 2 only the master thread processes i whereas in method 3 each thread processes a private i . 方法3和方法2之间的唯一区别在于，在方法2中，只有主线程处理i而在方法3中，每个线程处理私有i 。 But this seems too trivially to me to explain the significant difference between the methods because the implicit barrier in method 3 causes them to sync anyway and processing i is a matter of an increment and a conditional test. 但这对我来说似乎过于平凡，无法解释方法之间的显着差异，因为方法3中的隐式障碍导致它们无论如何同步并且处理i是增量和条件测试的问题。
The fact that method 3 is no slower than method 4 which processes whole strips of columns in parallel says the extra overhead in method 2 is all in leaving and entering a parallel region for each iteration of i 方法3并不慢于并行处理整个条带的方法4这一事实表明方法2中的额外开销全部在离开并且为每次迭代i进入并行区域

So my conclusion is that to explain why method 2 is so much slower than method 3 requires looking into the implementation of the thread pool. 所以我的结论是解释为什么方法2比方法3要慢得多，需要调查线程池的实现。 For GCC which uses pthreads, this could probably be explained by creating a toy model of a thread pool but I don't have enough experience with that yet. 对于使用pthreads的GCC，可能可以通过创建线程池的玩具模型来解释，但我还没有足够的经验。

Answer 2

I think the simple reason is that since you place your #pragma omp parallel at a outter scope level (second version), the overhead for calling threads is less consuming. 我认为简单的原因是，由于您将#pragma omp parallel放在外部作用域级别（第二个版本），因此调用线程的开销较少。

In other terms, in the first version, you call threads creation in the first loop itemRows time whereas in the second version, you call the creation only once. itemRows ，在第一个版本中，您在第一个循环itemRows时间调用线程创建，而在第二个版本中，您只调用一次创建。 And I do not know why ! 同时我不知道为什么！

I have tried reproduce a simple example to illustrate that, using 4 threads with HT enabled : 我尝试重现一个简单的例子来说明，使用启用了HT的4个线程：

#include <iostream>
#include <vector>
#include <algorithm>
#include <omp.h>

int main()
{
    std::vector<double> v(10000);
    std::generate(v.begin(),  v.end(), []() { static double n{0.0}; return n ++;} );

    double start = omp_get_wtime();

    #pragma omp parallel // version 2
    for (auto& el :  v) 
    {
        double t = el - 1.0;
        // #pragma omp parallel // version 1
        #pragma omp for
        for (size_t i = 0; i < v.size(); i ++)
        {
            el += v[i];
            el-= t;
        }
    }
    double end = omp_get_wtime();

    std::cout << "   wall time : " << end - start << std::endl;
    // for (const auto& el :  v) { std::cout << el << ";"; }

}

Comment/uncomment according to the version you want. 根据您想要的版本注释/取消注释。 If you compile with : -std=c++11 -fopenmp -O2 you should see that the version 2 is faster. 如果您使用： -std=c++11 -fopenmp -O2编译，您应该会看到版本2更快。

Demo on Coliru Coliru演示

Live Version 1 wall time : 0.512144 直播版1 wall time : 0.512144

Live version 2 wall time : 0.333664 直播版2 wall time : 0.333664

OpenMP - 在外部循环之前进行并行时，嵌套for循环变得更快。为什么？

问题描述

2 个解决方案

解决方案1
6 已采纳 2015-07-13 11:58:38

解决方案2
0 2015-07-09 16:16:17

OpenMP - 在外部循环之前进行并行时，嵌套for循环变得更快。为什么？

问题描述

2 个解决方案

解决方案1 6 已采纳 2015-07-13 11:58:38

解决方案2 0 2015-07-09 16:16:17

解决方案1
6 已采纳 2015-07-13 11:58:38

解决方案2
0 2015-07-09 16:16:17