OpenMP迭代for循环并行区域

Question

Sorry if the title's a big unclear. 对不起，如果标题是一个很大的不清楚。 I don't quite know how to word this. 我不太清楚怎么说这个。

I'm wondering if there's any way I can do the following: 我想知道我是否有办法做到以下几点：

#pragma omp parallel
{
    for (int i = 0; i < iterations; i++) {
        #pragma omp for
        for (int j = 0; j < N; j++)
            // Do something
    }
}

Ignoring things such as omitting private specifiers in the for loop, is there any way that I can fork threads outside of my outer loop so that I can just parallelize the inner loop? 忽略诸如在for循环中省略私有说明符之类的东西，有什么方法可以在我的外部循环之外分叉线程，这样我就可以并行化内部循环？ From my understanding (please do correct me if I'm wrong), all threads will execute the outer loop. 从我的理解（如果我错了请纠正我），所有线程将执行外部循环。 I'm unsure about the behavior of the inner loop, but I think the for will distribute chunks to each threads that encounter it. 我不确定内循环的行为，但我认为for会将块分配给遇到它的每个线程。

What I want to do is not have to fork/join iterations times but just do it once in the outer loop. 我想要做的是不需要fork / join iterations次数，而只需在外部循环中执行一次。 Is this the right strategy to do so? 这是正确的策略吗？

What if there were another outer loop that shouldn't be parallelized? 如果有另一个外环不应该并行化怎么办？ That is... 那是...

#pragma omp parallel
{

    for (int i = 0; i < iterations; i++) {
        for(int k = 0; k < innerIterations; k++) {
            #pragma omp for
            for (int j = 0; j < N; j++)
                // Do something

            // Do something else
        }
    }
}

It'd be great if someone were to point me to an example of a large application parallelized using OpenMP so that I could better understand strategies to employ when using OpenMP. 如果有人向我指出使用OpenMP并行化的大型应用程序的示例，那将是很好的，这样我就可以更好地理解使用OpenMP时要采用的策略。 I can't seem to find any. 我似乎找不到任何东西。

Clarification: I'm looking for solutions that do not change loop ordering or involve blocking, caching, and general performance considerations. 澄清：我正在寻找不会改变循环排序或涉及阻塞，缓存和一般性能考虑因素的解决方案。 I want to understand how this could be done in OpenMP on the loop structure as specified. 我想了解如何在OpenMP上对指定的循环结构进行此操作。 The // Do something may or may not have dependencies, assume that they do and that you can't move things around. // Do something可能有也可能没有依赖，假设他们这样做，你不能移动东西。

Answer 1

The way you handled the two for loops looks right to me, in the sense that it achieves the behavior you wanted: the outer loop is not parallelized, while the inner loop is. 处理这两个for循环的方式对我来说是正确的，因为它实现了你想要的行为：外部循环不是并行化的，而内部循环是。

To better clarify what happens, I'll try to add some notes to your code: 为了更好地说明会发生什么，我会尝试在代码中添加一些注释：

#pragma omp parallel
{
  // Here you have a certain number of threads, let's say M
  for (int i = 0; i < iterations; i++) {
        // Each thread enters this region and executes all the iterations 
        // from i = 0 to i < iterations. Note that i is a private variable.
        #pragma omp for
        for (int j = 0; j < N; j++) {
            // What happens here is shared among threads so,
            // according to the scheduling you choose, each thread
            // will execute a particular portion of your N iterations
        } // IMPLICIT BARRIER             
  }
}

The implicit barrier is a point of synchronization where threads wait for each other. 隐式屏障是线程彼此等待的同步点。 As a general rule of the thumb it is thus preferable to parallelize outer loops rather than inner loops , as this will create a single point of synchronization for the iterations*N iterations (instead of the iterations points you are creating above). 因此，作为拇指的一般规则，因此优选并行化外部循环而不是内部循环 ，因为这将为iterations*N次迭代（而不是您在上面创建的iterations点）创建单个同步点。

Answer 2

I'm not sure I can answer your question. 我不确定我能回答你的问题。 I have only been using OpenMP for a few months now but when I try to answer questions like this I do some hello world printf tests like I show below. 我现在只使用OpenMP几个月，但是当我尝试回答这样的问题时，我会做一些你好的世界printf测试，如下所示。 I think that may help answer your questions. 我认为这可能有助于回答您的问题。 Also try #pragma omp for nowait and see what happens. 也可以尝试#pragma omp for nowait ，看看会发生什么。

Just make sure when you "// Do something and // Do something else" that you don't write to the same memory address and create a race condition. 只要确保当你“//做某事并做其他事情”而你没有写入同一个内存地址并创建竞争条件时。 Also, if you're doing a lot of reading and writing you need to think about how to efficiently use the cache. 此外，如果您正在进行大量的阅读和写作，则需要考虑如何有效地使用缓存。

#include "stdio.h"
#include <omp.h>
void loop(const int iterations, const int N) {
    #pragma omp parallel
    {
        int start_thread = omp_get_thread_num();
        printf("start thread %d\n", start_thread);
        for (int i = 0; i < iterations; i++) {
            printf("\titeration %d, thread num %d\n", i, omp_get_thread_num());
            #pragma omp for
            for (int j = 0; j < N; j++) {
                printf("\t\t inner loop %d, thread num %d\n", j, omp_get_thread_num());
            }
        }
    }
}

int main() {
    loop(2,30);
}

In terms of performance you might want to consider fusing your loop like this. 在性能方面，您可能需要考虑融合您的循环。

#pragma omp for
for(int n=0; n<iterations*N; n++) {
    int i = n/N;
    int j = n%N;    
    //do something as function of index i and j
}

Answer 3

It is difficult to answer since it really depends on the dependencies inside your code. 很难回答，因为它实际上取决于代码中的依赖关系。 But a general way to solve this is to invert the nesting of the loops, like this: 但解决这个问题的一般方法是反转循环的嵌套，如下所示：

#pragma omp parallel
{
    #pragma omp for
    for (int j = 0; j < N; j++) {
        for (int i = 0; i < iterations; i++) {
            // Do something
        }
    }
}

Off course, this can or cannot be possible, depending of what is your code inside the loop. 当然，这可能是也可能是不可能的，这取决于你在循环中的代码是什么。

OpenMP迭代for循环并行区域

问题描述

3 个解决方案

解决方案1
3 2013-05-08 19:19:29

解决方案2
1

解决方案3
0 2013-05-08 13:14:06

OpenMP迭代for循环并行区域

问题描述

3 个解决方案

解决方案1 3 2013-05-08 19:19:29

解决方案2 1

解决方案3 0 2013-05-08 13:14:06

解决方案1
3 2013-05-08 19:19:29

解决方案2
1

解决方案3
0 2013-05-08 13:14:06