简体   繁体   English

在C上使用OpenMP的同时并行

[英]Parallel for inside a while using OpenMP on C

I'm trying to do a parallel for inside a while, somothing like this: 我正在尝试在一段时间内进行并行处理,像这样:

while(!End){
    for(...;...;...) // the parallel for

    ...
    // serial code
}

The for loop is the only parallel section of the while loop. for循环是while循环的唯一并行部分。 If I do this, I have a lot of overhead: 如果这样做,我将有很多开销:

cycles = 0;
while(!End){ // 1k Million iterations aprox
    #pragma omp parallel for
    for(i=0;i<N;i++) // the parallel for with 256 iteration aprox
        if(time[i] == cycles){
           if (wbusy[i]){
               wbusy[i] = 0;
               wfinished[i] = 1;
           }
        }


    // serial code
    ++cycles;    

}

Each iteration of the for loop are indepent with each other. for循环的每个迭代彼此独立。

There are dependencies between serial code and parallel code. 串行代码和并行代码之间存在依赖关系。

So normally one doesn't have to worry too much about putting parallel regions into loops, as modern openmp implementations are pretty efficient about using things like thread teams and as long as there's lots of work in the loop you're fine. 因此,通常不必太担心将并行区域放入循环中,因为现代openmp实现对于使用线程团队之类的工具非常有效,并且只要循环中有很多工作就可以了。 But here, with an outer loop count of ~1e9 and an inner loop count of ~256 - and very little work being done per iteration - the overhead is likely comparable to or worse than the amount of work being done and performance will suffer. 但是在这里,外循环计数为〜1e9,内循环计数为〜256-每次迭代完成的工作量很少-开销可能与完成的工作量相当甚至更差,并且性能会受到影响。

So there will be a noticeable difference between this: 因此,这之间会有明显的区别:

cycles = 0;
while(!End){ // 1k Million iterations aprox
    #pragma omp parallel for
    for(i=0;i<N;i++) // the parallel for with 256 iteration aprox
        if(time[i] == cycles){
           if (wbusy[i]){
               wbusy[i] = 0;
               wfinished[i] = 1;
           }
        } 

    // serial code
    ++cycles;    
}

and this: 和这个:

cycles = 0;
#pragma omp parallel
while(!End){ // 1k Million iterations aprox
    #pragma omp for
    for(i=0;i<N;i++) // the parallel for with 256 iteration aprox
        if(time[i] == cycles){
           if (wbusy[i]){
               wbusy[i] = 0;
               wfinished[i] = 1;
           }
        } 

    // serial code
    #pragma omp single 
    {
      ++cycles;    
    }
}

But really, that scan across the time array every iteration is unfortunately both (a) slow and (b) not enough work to keep multiple cores busy - it's memory intensive. 但实际上,不幸的是,每次迭代在时间阵列上进行扫描既(a)速度慢,又(b)没有足够的工作来保持多个内核繁忙-它占用大量内存。 With more than a couple of threads you will actually have worse performance than serial, even without overheads, just because of memory contention. 实际上,由于内存争用,即使拥有了多个线程,您实际上也会比串行性能差,甚至没有开销。 Admittedly what you have posted here is just an example, not your real code, but why don't you preprocess the time array so you can just check to see when the next task is ready to update: 诚然,您在此处发布的内容只是一个示例,而不是您的真实代码,但是为什么不对时间数组进行预处理,以便可以检查下一个任务何时可以更新:

#include <stdio.h>
#include <stdlib.h>

struct tasktime_t {
    long int time;
    int task;
};

int stime_compare(const void *a, const void *b) {
    return ((struct tasktime_t *)a)->time - ((struct tasktime_t *)b)->time;
}

int main(int argc, char **argv) {
    const int n=256;
    const long int niters = 100000000l;
    long int time[n];
    int wbusy[n];
    int wfinished[n];

    for (int i=0; i<n; i++) {
        time[i] = rand() % niters;
        wbusy[i] = 1;
        wfinished[i] = 0;
    }

    struct tasktime_t stimes[n];

    for (int i=0; i<n; i++) {
        stimes[i].time = time[i];
        stimes[i].task = i;
    }

    qsort(stimes, n, sizeof(struct tasktime_t), stime_compare);

    long int cycles = 0;
    int next = 0;
    while(cycles < niters){ // 1k Million iterations aprox
        while ( (next < n) && (stimes[next].time == cycles) ) {
           int i = stimes[next].task;
           if (wbusy[i]){
               wbusy[i] = 0;
               wfinished[i] = 1;
           }
           next++;
        }

        ++cycles;
    }

    return 0;
}

This is ~5 times faster than the serial version of the scanning approach (and much faster than the OpenMP versions). 这比扫描方式(比OpenMP的版本更快 )的串行版本快〜5倍。 Even if you are constantly updating the time/wbusy/wfinished arrays in the serial code, you can keep track of their completion times using a priority queue with each update taking O(ln(N)) time instead of scanning every iteration taking O(N) time. 即使您不断更新序列码中的时间/忙/忙数组,您也可以使用优先级队列来跟踪它们的完成时间,每次更新占用O(ln(N))时间,而不是每次扫描都要花费O(ln(N)) N)时间。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM