简体   繁体   English

OpenMP初学者-cicle问题

[英]Beginner in OpenMP - Problems in cicle

I am a beginner in OpenMP and i am trying to parallelize the following function: 我是OpenMP的初学者,我正在尝试并行化以下功能:

void calc(double *x, int *l[N], int d[N], double *z){

    #pragma omp parallel for
    for(int i=0; i<N; i++){

        double tmp = d[i]>0 ? ((double) z[i] / d[i]) : ((double) z[i] / N);

        for(int j=0; j<d[i]; j++)
            x[l[i][j]] += tmp;

    }

}

But for an N=100000 the sequential time is about 50 seconds and with 2 or more threads it goes up to several minutes. 但是,对于N = 100000的顺序时间大约为50秒,如果使用2个或更多线程,则需要花费几分钟的时间。

The L array of pointers has randomly between 1 and 30 elements (given by the corresponding position in the d array) and the elements varies between 0 and N, so i know i have a load-balance problem but if i had a guided or dynamic scheduling (even auto) the times are even worse. L指针数组随机包含1到30个元素(由d数组中的相应位置决定),并且元素在0到N之间变化,所以我知道我有一个负载平衡问题,但是如果我有引导式或动态式安排(甚至自动)时间更糟。

I also know that the problem is obviously in the accesses to the x array because its not being contiguously acceded but is there a way to fix this problem and have some kind of speedups in this function? 我也知道,问题显然出在对x数组的访问中,因为它不是连续加入的,但是有没有办法解决此问题并在此函数中进行某种加速?

Thanks in advance! 提前致谢!

Assuming you can afford to use some extra space to do it, you can probably speed this up. 假设您有能力使用一些额外的空间来执行此操作,则可以加快速度。

The basic idea would be to create a separate array of sums for each thread, then when they're all done add up the corresponding elements in those separate copies, and finally add each element of that result to the corresponding element in the original x . 基本思想是为每个线程创建一个单独的总和数组,然后当它们全部完成后,在这些单独的副本中添加相应的元素,最后将结果的每个元素添加到原始x的相应元素。

As long as x is fairly small that's probably pretty reasonable. 只要x很小,那可能就很合理了。 If x might be really huge, it may get less practical in a hurry. 如果x可能真的很大,可能会急于减少实用性。 Given that L is apparently only about 30 elements, it sounds like x is probably limited to around 30 elements (that can actually be used while running this code, anyway) as well. 考虑到L显然只有大约30个元素,听起来x也可能限于大约30个元素(无论如何在运行此代码时实际上都可以使用)。 If that's correct, then having a separate copy for each thread shouldn't cause a major problem. 如果是正确的话,那么每个线程都有一个单独的副本应该不会造成重大问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM