[英]Make a reduction with OpenMP to compute the final summed value of an element of matrix
I have the following double loop where I compute the element of matrix Fisher_M[FX][FY]
.我有以下双循环,我计算矩阵Fisher_M[FX][FY]
的元素。 I tried to optimize it by putting a OMP pragma #pragma omp parallel for schedule(dynamic, num_threads)
but gain is not as good as expected.我试图通过#pragma omp parallel for schedule(dynamic, num_threads)
放置一个 OMP pragma #pragma omp parallel for schedule(dynamic, num_threads)
来优化它,但增益没有预期的那么好。
Is there a way to do a reduction witht OpenMP (of sum) to compute quickly the element Fisher_M[FX][FY]
?有没有办法使用 OpenMP(总和)进行减少以快速计算元素Fisher_M[FX][FY]
? Or maybe this is doable with MAGMA or CUDA ?或者也许这对 MAGMA 或 CUDA 可行?
#define num_threads 8
#pragma omp parallel for schedule(dynamic, num_threads)
for(int i=0; i<CO_CL_WL.size(); i++){
for(int j=0; j<CO_CL_WL.size(); j++){
if( CO_CL_WL[i][j] != 0 || CO_CL_WL_D[i][j] != 0){
Fisher_M[FX][FY] += CO_CL_WL[i][j]*CO_CL_WL_D[i][j];
}
}
}
Your code has a race condition at line Fisher_M[FX][FY] += ...
.您的代码在Fisher_M[FX][FY] += ...
行存在竞争条件。 Reduction can be used to solve it:可以用归约来解决:
double sum=0; //change the type as needed
#pragma omp parallel for reduction(+:sum)
for(int i=0; i<CO_CL_WL.size(); i++){
for(int j=0; j<CO_CL_WL.size(); j++){
if( CO_CL_WL[i][j] != 0 || CO_CL_WL_D[i][j] != 0){
sum += CO_CL_WL[i][j]*CO_CL_WL_D[i][j];
}
}
}
Fisher_M[FX][FY] += sum;
Note that this code is memory bound, not computation expensive, so the perfomance gain by parallelization may be smaller than expected (and depends on your hardware).请注意,此代码受内存限制,计算成本不高,因此并行化的性能增益可能比预期的要小(并且取决于您的硬件)。
Ps: Why do you need this condition if( CO_CL_WL[i][j] != 0 || CO_CL_WL_D[i][j] != 0)
? Ps:你为什么需要这个条件if( CO_CL_WL[i][j] != 0 || CO_CL_WL_D[i][j] != 0)
? If any of them is zero, the sum will not change.如果它们中的任何一个为零,则总和不会改变。 If you remove it, the compiler can make much better vectorized code.如果删除它,编译器可以制作更好的矢量化代码。
Ps2: In the schedule(dynamic, num_threads)
clause the second parameter is the chunk size not the number of threads used. Ps2:在schedule(dynamic, num_threads)
子句中,第二个参数是块大小,而不是使用的线程数。 I suggest removing it in your your case.我建议在你的情况下删除它。 If you wish to specify the number of threads used, please add num_threads
clause or use omp_set_num_threads
function.如果要指定使用的线程数,请添加num_threads
子句或使用omp_set_num_threads
函数。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.