使用 OpenMP 在 C、C++ 中并行化嵌套 for 循环的几种方法之间的区别

Question

I've just started studying parallel programming with OpenMP, and there is a subtle point in the nested loop.我刚刚开始学习使用 OpenMP 进行并行编程，嵌套循环中有一个微妙的点。 I wrote a simple matrix multiplication code, and checked the result that is correct.我写了一个简单的矩阵乘法代码，并检查了正确的结果。 But actually there are several ways to parallelize this for loop, which may be different in terms of low-level detail, and I wanna ask about it.但实际上有几种方法可以并行化这个for循环，在低级细节方面可能会有所不同，我想问问。

At first, I wrote code below, which multiply two matrix A, B and assign the result to C.起初，我写了下面的代码，将两个矩阵 A、B 相乘，并将结果赋给 C。

for(i = 0; i < N; i++)
{
    for(j = 0; j < N; j++)
    {
        sum = 0;
#pragma omp parallel for reduction(+:sum)
        for(k = 0; k < N; k++)
        {
            sum += A[i][k]*B[k][j];
        }
        C[i][j] = sum;
    }
}

It works, but it takes really long time.它有效，但需要很长时间。 And I find out that because of the location of parallel directive, it will construct the parallel region N ² time.我发现由于parallel指令的位置，它会构造并行区域N ²次。 I found it by huge increase in user time when I used linux time command.当我使用 linux time 命令时，我发现用户时间大幅增加。

Next time, I tried code below which also worked.下一次，我尝试了下面的代码，它也有效。

#pragma omp parallel for private(i, j, k, sum)
for(i = 0; i < N; i++)
{
    for(j = 0; j < N; j++)
    {
        sum = 0;
        for(k = 0; k < N; k++)
        {
            sum += A[i][k]*B[k][j];
        }
        C[i][j] = sum;
    }
}

And the elapsed time is decreased from 72.720s in sequential execution to 5.782s in parallel execution with the code above.使用上面的代码，运行时间从顺序执行的 72.720 秒减少到并行执行的 5.782 秒。 And it is the reasonable result because I executed it with 16 cores.这是合理的结果，因为我用 16 个内核执行它。

But the flow of the second code is not easily drawn in my mind.但是第二个代码的流程在我的脑海中并不容易画出来。 I know that if we privatize all loop variables, the program will consider that nested loop as one large loop with size N ³ .我知道如果我们私有化所有循环变量，程序会认为嵌套循环是一个大小为 N ^{3 的}大循环。 It can be easily checked by executing the code below.通过执行下面的代码可以很容易地检查它。

#pragma omp parallel for private(i, j, k)
for(i = 0; i < N; i++)
{
    for(j = 0; j < N; j++)
    {
        for(k = 0; k < N; k++)
        {
            printf("%d, %d, %d\n", i, j, k);
        }
    }
}

The printf was executed N ³ times. printf执行了 N ³次。

But in my second matrix multiplication code, there is sum right before and after the innermost loop.但是在我的第二个矩阵乘法代码中，最内层循环前后都有sum 。 And It bothers me to unfold the loop in my mind easily.并且在我的脑海中轻松展开循环让我感到困扰。 The third code I wrote is easily unfolded in my mind.我写的第三个代码很容易在我的脑海中展开。

To summarize, I want to know what really happens behind the scene in my second matrix multiplication code, especially with the change of the value of sum .总而言之，我想知道在我的第二个矩阵乘法代码中幕后到底发生了什么，尤其是sum值的变化。 Or I'll really thank you for some recommendation of tools to observe the flow of multithreads program written with OpenMP.或者我真的很感谢你推荐一些工具来观察用 OpenMP 编写的多线程程序的流程。

Answer 1

omp for by default only applies to the next direct loop. omp for默认情况下仅适用于下一个直接循环。 The inner loops are not affected at all.内循环根本不受影响。 This means, your can think about your second version like this:这意味着，您可以像这样考虑您的第二个版本：

// Example for two threads
with one thread execute
{
    // declare private variables "locally"
    int i, j, k;
    for(i = 0; i < N / 2; i++) // loop range changed
    {
        for(j = 0; j < N; j++)
        {
            sum = 0;
            for(k = 0; k < N; k++)
            {
                sum += A[i][k]*B[k][j];
            }
            C[i][j] = sum;
        }
    }
}
with the other thread execute
{
    // declare private variables "locally"
    int i, j, k;
    for(i = N / 2; i < N; i++) // loop range changed
    {
        for(j = 0; j < N; j++)
        {
            sum = 0;
            for(k = 0; k < N; k++)
            {
                sum += A[i][k]*B[k][j];
            }
            C[i][j] = sum;
        }
    }
}

You can simply all reasoning about variables with OpenMP by declaring them as locally as possible.您可以通过尽可能在本地声明变量来简单地使用 OpenMP 对变量进行所有推理。 Ie instead of the explicit declaration use:即而不是显式声明使用：

#pragma omp parallel for
for(int i = 0; i < N; i++)
{
    for(int j = 0; j < N; j++)
    {
        int sum = 0;
        for(int k = 0; k < N; k++)
        {
            sum += A[i][k]*B[k][j];
        }
        C[i][j] = sum;
    }
}

This way you the private scope of variable more easily.这样你可以更轻松地获得变量的私有范围。

In some cases it can be beneficial to apply parallelism to multiple loops.在某些情况下，将并行应用于多个循环可能是有益的。 This is done by using collapse , ie这是通过使用collapse完成的，即

#pragma omp parallel for collapse(2)
for(int i = 0; i < N; i++)
{
    for(int j = 0; j < N; j++)

You can imagine this works with a transformation like:您可以想象这适用于以下转换：

#pragma omp parallel for
for (int ij = 0; ij < N * N; ij++)
{
    int i = ij / N;
    int j = ij % N;

A collapse(3) would not work for this loop because of the sum = 0 in-between.由于中间的sum = 0 ， collapse(3)不适用于此循环。

Now is one more detail:现在是一个细节：

#pragma omp parallel for

is a shorthand for是一个简写

#pragma omp parallel
#pragma omp for

The first creates the threads - the second shares the work of a loop among all threads reaching this point.第一个创建线程 - 第二个在到达该点的所有线程之间共享循环的工作。 This may not be of importance for the understanding now, but there are use-cases for which it matters.这对于现在的理解可能并不重要，但对于某些用例来说很重要。 For instance you could write:例如你可以写：

#pragma omp parallel
for(int i = 0; i < N; i++)
{
    #pragma omp for
    for(int j = 0; j < N; j++)
    {

I hope this sheds some light on what happens there from a logical point of view.我希望这能从逻辑的角度对那里发生的事情有所了解。

使用 OpenMP 在 C、C++ 中并行化嵌套 for 循环的几种方法之间的区别

问题描述

1 个解决方案

解决方案1
3 已采纳 2019-07-17 11:46:19

使用 OpenMP 在 C、C++ 中并行化嵌套 for 循环的几种方法之间的区别

问题描述

1 个解决方案

解决方案1 3 已采纳 2019-07-17 11:46:19

解决方案1
3 已采纳 2019-07-17 11:46:19