[英]Sequential and parallel versions give different results - Why?
I have a nested loop: (L and A are fully defined inputs)我有一个嵌套循环:(L 和 A 是完全定义的输入)
#pragma omp parallel for schedule(guided) shared(L,A) \
reduction(+:dummy)
for (i=k+1;i<row;i++){
for (n=0;n<k;n++){
#pragma omp atomic
dummy += L[i][n]*L[k][n];
L[i][k] = (A[i][k] - dummy)/L[k][k];
}
dummy = 0;
}
And its sequential version:及其顺序版本:
for (i=k+1;i<row;i++){
for (n=0;n<k;n++){
dummy += L[i][n]*L[k][n];
L[i][k] = (A[i][k] - dummy)/L[k][k];
}
dummy = 0;
}
They both give different results.他们都给出不同的结果。 And parallel version is much slower than the sequential version.
并行版本比顺序版本慢得多。
What may cause the problem?什么可能导致问题?
Edit:编辑:
To get rid of the problems caused by the atomic directive, I modified the code as follows:为了摆脱原子指令引起的问题,我修改了代码如下:
#pragma omp parallel for schedule(guided) shared(L,A) \
private(i)
for (i=k+1;i<row;i++){
double dummyy = 0;
for (n=0;n<k;n++){
dummyy += L[i][n]*L[k][n];
L[i][k] = (A[i][k] - dummyy)/L[k][k];
}
}
But it also didn't work out the problem.但它也没有解决问题。 Results are still different.
结果还是不一样。
I am not very familiar with OpenMP but it seems to me that your calculations are not order-independent.我对 OpenMP 不太熟悉,但在我看来,您的计算与顺序无关。 Namely, the result in the inner loop is written into
L[i][k]
where i
and k
are invariants for the inner loop.即,将内循环的结果写入
L[i][k]
,其中i
和k
是内循环的不变量。 This means that the same value is overwritten k
times during the inner loop, resulting in a race condition.这意味着同一值在内循环中被覆盖
k
次,从而导致竞争条件。
Moreover, dummy
seems to be shared between the different threads, so there might be a race condition there too, unless your pragma parameters somehow prevent it.此外,
dummy
似乎在不同线程之间共享,因此那里也可能存在竞争条件,除非您的 pragma 参数以某种方式阻止它。
Altogether, to me it looks like the calculations in the inner loop must be performed in the same sequential order, if you want the same result as given by the sequential execution.总之,在我看来,如果您想要顺序执行给出的结果相同,则内部循环中的计算必须以相同的顺序执行。 Thus only the outer loop can be parallelized.
因此只有外部循环可以并行化。
In your parallel version you've inserted an unnecessary (and possibly harmful) atomic directive.在您的并行版本中,您插入了一个不必要的(并且可能有害的)原子指令。 Once you've declared
dummy
to be a reduction variable OpenMP takes care of stopping the threads interfering in the reduction.一旦您将
dummy
声明为缩减变量,OpenMP 就会负责停止干扰缩减的线程。 I think the main impact of the unnecessary directive is to slow your code down, a lot.我认为不必要的指令的主要影响是大大降低了代码速度。
I see you have another answer addressing the wrongness of your results.我看到您有另一个解决结果错误的答案。 But I notice that you seem to set
dummy
to 0
at the end of each outer loop iteration, which seems strange if you are trying to use it as some kind of accumulator, which is what the reduction clause suggests.但是我注意到你似乎在每次外循环迭代结束时将
dummy
设置为0
,如果你试图将它用作某种累加器,这似乎很奇怪,这就是 reduction 子句所建议的。 Perhaps you want to reduce to dummy
across the inner loop?也许您想减少内部循环的
dummy
?
If you are having problems with reduction read this .如果您在还原方面遇到问题,请阅读此。
The difference in results comes from the inner loop variable n
, which is shared between threads, since it is defined outside of the omp pragma.结果的差异来自内部循环变量
n
,它在线程之间共享,因为它是在 omp pragma 之外定义的。
Clarified: The loop variable n
should be declared inside the omp pragma, since it should be thread-specific, for example for (int n = 0;.....)
澄清:循环变量
n
应该在 omp pragma 中声明,因为它应该是特定于线程的,例如for (int n = 0;.....)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.