简体   繁体   English

使用 OpenMP 进行归约以计算矩阵元素的最终总和值

[英]Make a reduction with OpenMP to compute the final summed value of an element of matrix

I have the following double loop where I compute the element of matrix Fisher_M[FX][FY] .我有以下双循环,我计算矩阵Fisher_M[FX][FY]的元素。 I tried to optimize it by putting a OMP pragma #pragma omp parallel for schedule(dynamic, num_threads) but gain is not as good as expected.我试图通过#pragma omp parallel for schedule(dynamic, num_threads)放置一个 OMP pragma #pragma omp parallel for schedule(dynamic, num_threads)来优化它,但增益没有预期的那么好。

Is there a way to do a reduction witht OpenMP (of sum) to compute quickly the element Fisher_M[FX][FY] ?有没有办法使用 OpenMP(总和)进行减少以快速计算元素Fisher_M[FX][FY] Or maybe this is doable with MAGMA or CUDA ?或者也许这对 MAGMA 或 CUDA 可行?

#define num_threads 8

#pragma omp parallel for schedule(dynamic, num_threads)
for(int i=0; i<CO_CL_WL.size(); i++){
    for(int j=0; j<CO_CL_WL.size(); j++){
        if( CO_CL_WL[i][j] != 0 || CO_CL_WL_D[i][j] != 0){
          Fisher_M[FX][FY] += CO_CL_WL[i][j]*CO_CL_WL_D[i][j];
        }
    }
}

Your code has a race condition at line Fisher_M[FX][FY] += ... .您的代码在Fisher_M[FX][FY] += ...行存在竞争条件。 Reduction can be used to solve it:可以用归约来解决:

double sum=0;  //change the type as needed
#pragma omp parallel for reduction(+:sum) 
for(int i=0; i<CO_CL_WL.size(); i++){
    for(int j=0; j<CO_CL_WL.size(); j++){
        if( CO_CL_WL[i][j] != 0 || CO_CL_WL_D[i][j] != 0){
          sum += CO_CL_WL[i][j]*CO_CL_WL_D[i][j];
        }
    }
}
Fisher_M[FX][FY] += sum;

Note that this code is memory bound, not computation expensive, so the perfomance gain by parallelization may be smaller than expected (and depends on your hardware).请注意,此代码受内存限制,计算成本不高,因此并行化的性能增益可能比预期的要小(并且取决于您的硬件)。

Ps: Why do you need this condition if( CO_CL_WL[i][j] != 0 || CO_CL_WL_D[i][j] != 0) ? Ps:你为什么需要这个条件if( CO_CL_WL[i][j] != 0 || CO_CL_WL_D[i][j] != 0) If any of them is zero, the sum will not change.如果它们中的任何一个为零,则总和不会改变。 If you remove it, the compiler can make much better vectorized code.如果删除它,编译器可以制作更好的矢量化代码。

Ps2: In the schedule(dynamic, num_threads) clause the second parameter is the chunk size not the number of threads used. Ps2:在schedule(dynamic, num_threads)子句中,第二个参数是块大小,而不是使用的线程数。 I suggest removing it in your your case.我建议在你的情况下删除它。 If you wish to specify the number of threads used, please add num_threads clause or use omp_set_num_threads function.如果要指定使用的线程数,请添加num_threads子句或使用omp_set_num_threads函数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM