简体   繁体   English

使用OpenMP并行化C ++代码,计算实际上并行较慢

[英]Parallelizing C++ code using OpenMP, calculations actually slower in parallel

I have the following code that I want to parallelize: 我有以下要并行化的代码:

int ncip( int dim, double R)
{   
    int i;
    int r = (int)floor(R);
    if (dim == 1)
    {   
        return 1 + 2*r; 
    }
    int n = ncip(dim-1, R); // last coord 0

    #pragma omp parallel for
    for(i=1; i<=r; ++i)
    {   
        n += 2*ncip(dim-1, sqrt(R*R - i*i) ); // last coord +- i
    }

    return n;
}

The program execution time when ran without openmp is 6.956s when I try and parallelize the for loop my execution time is greater than 3 minutes (and that's because I ended it myself). 当我尝试并行化for循环时,没有openmp运行的程序执行时间是6.956s,我的执行时间大于3分钟(这是因为我自己结束了)。 What am I doing wrong in regards to parallelizing this code ? 在并行化此代码方面我做错了什么?

second attempt 第二次尝试

    int ncip( int dim, double R)
{   
int i;
int r = (int)floor( R);
if ( dim == 1)
{   return 1 + 2*r; 
}


#pragma omp parallel 
{
int n = ncip( dim-1, R); // last coord 0
#pragma omp for reduction (+:n)
for( i=1; i<=r; ++i)
{   
    n += 2*ncip( dim-1, sqrt( R*R - i*i) ); // last coord +- i
}

}

return n;

}

You are doing that wrong! 你做错了!

(1) There are data races in variable n . (1)变量n存在数据竞争。 If you want to parallelize a code that have writes in the same memory zone, you must use the reduction (in the for), atomic or critical to avoid data hazards. 如果要并行化在同一内存区域中写入的代码,则必须使用缩减 (在for中), 原子关键来避免数据危害。

(2) Probably you have the nested parallelism enabled, so the program is creating a new parallel zone every time you call the function ncip . (2)可能你已经启用了嵌套并行,所以每次调用函数ncip时程序都会创建一个新的并行区域。 Should be this the main problem. 应该是这个主要问题。 For recursive functions I advise you to create just one parallel zone and then use the pragma omp task . 对于递归函数,我建议您只创建一个并行区域,然后使用pragma omp task

Do not parallelize with #pragma omp for and try with the #pragma omp task . 不要与#pragma omp for并行化并尝试使用#pragma omp task Look this example: 看这个例子:

int ncip(int dim, double R){
    ...
    #pragma omp task
    ncip(XX, XX);

    #pragma omp taskwait
    ...
}

int main(int argc, char *argv[]) {
    #pragma omp parallel
    {
        #pragma omp single 
        ncip(XX, XX);
    } 
    return(0); 
}

UPDATE: 更新:

//Detailed version (without omp for and data races)
int ncip(int dim, double R){
    int n, r = (int)floor(R);

    if (dim == 1) return 1 + 2*r; 

    n = ncip(dim-1, R); // last coord 0

    for(int i=1; i<=r; ++i){   
        #pragma omp task
        {
            int aux = 2*ncip(dim-1, sqrt(R*R - i*i) ); // last coord +- i

            #pragma omp atomic
            n += aux;
        }
    }
    #pragma omp taskwait
    return n;
}

PS: You'll not get a speedup from this, because overhead to creat a task is bigger than the work of a single task. PS:你不会从中获得加速,因为创建任务的开销大于单个任务的工作。 The best thing you can do is re-write this algorithm to an iterative version, and then try to parallelize it. 您可以做的最好的事情是将此算法重新编写为迭代版本,然后尝试并行化它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM