使用 OpenMP 进行归约以计算矩阵元素的最终总和值

Question

I have the following double loop where I compute the element of matrix Fisher_M[FX][FY] .我有以下双循环，我计算矩阵Fisher_M[FX][FY]的元素。 I tried to optimize it by putting a OMP pragma #pragma omp parallel for schedule(dynamic, num_threads) but gain is not as good as expected.我试图通过#pragma omp parallel for schedule(dynamic, num_threads)放置一个 OMP pragma #pragma omp parallel for schedule(dynamic, num_threads)来优化它，但增益没有预期的那么好。

Is there a way to do a reduction witht OpenMP (of sum) to compute quickly the element Fisher_M[FX][FY] ?有没有办法使用 OpenMP（总和）进行减少以快速计算元素Fisher_M[FX][FY] ？ Or maybe this is doable with MAGMA or CUDA ?或者也许这对 MAGMA 或 CUDA 可行？

#define num_threads 8

#pragma omp parallel for schedule(dynamic, num_threads)
for(int i=0; i<CO_CL_WL.size(); i++){
    for(int j=0; j<CO_CL_WL.size(); j++){
        if( CO_CL_WL[i][j] != 0 || CO_CL_WL_D[i][j] != 0){
          Fisher_M[FX][FY] += CO_CL_WL[i][j]*CO_CL_WL_D[i][j];
        }
    }
}

Answer 1

Your code has a race condition at line Fisher_M[FX][FY] += ... .您的代码在Fisher_M[FX][FY] += ...行存在竞争条件。 Reduction can be used to solve it:可以用归约来解决：

double sum=0;  //change the type as needed
#pragma omp parallel for reduction(+:sum) 
for(int i=0; i<CO_CL_WL.size(); i++){
    for(int j=0; j<CO_CL_WL.size(); j++){
        if( CO_CL_WL[i][j] != 0 || CO_CL_WL_D[i][j] != 0){
          sum += CO_CL_WL[i][j]*CO_CL_WL_D[i][j];
        }
    }
}
Fisher_M[FX][FY] += sum;

Note that this code is memory bound, not computation expensive, so the perfomance gain by parallelization may be smaller than expected (and depends on your hardware).请注意，此代码受内存限制，计算成本不高，因此并行化的性能增益可能比预期的要小（并且取决于您的硬件）。

Ps: Why do you need this condition if( CO_CL_WL[i][j] != 0 || CO_CL_WL_D[i][j] != 0) ? Ps：你为什么需要这个条件if( CO_CL_WL[i][j] != 0 || CO_CL_WL_D[i][j] != 0) ？ If any of them is zero, the sum will not change.如果它们中的任何一个为零，则总和不会改变。 If you remove it, the compiler can make much better vectorized code.如果删除它，编译器可以制作更好的矢量化代码。

Ps2: In the schedule(dynamic, num_threads) clause the second parameter is the chunk size not the number of threads used. Ps2：在schedule(dynamic, num_threads)子句中，第二个参数是块大小，而不是使用的线程数。 I suggest removing it in your your case.我建议在你的情况下删除它。 If you wish to specify the number of threads used, please add num_threads clause or use omp_set_num_threads function.如果要指定使用的线程数，请添加num_threads子句或使用omp_set_num_threads函数。

使用 OpenMP 进行归约以计算矩阵元素的最终总和值

问题描述

1 个解决方案

解决方案1
3 已采纳 2021-10-15 12:27:54

使用 OpenMP 进行归约以计算矩阵元素的最终总和值

问题描述

1 个解决方案

解决方案1 3 已采纳 2021-10-15 12:27:54

解决方案1
3 已采纳 2021-10-15 12:27:54