[英]openmp slower more than one threads, can't figure out
I got a problem that my following code runs slower with openmp: 我遇到一个问题,我的以下代码使用openmp运行速度较慢:
chunk = nx/nthreads;
int i, j;
for(int t = 0; t < n; t++){
#pragma omp parallel for default(shared) private(i, j) schedule(static,chunk)
for(i = 1; i < nx/2+1; i++){
for(j = 1; j < nx-1; j++){
T_c[i][j] =0.25*(T_p[i-1][j] +T_p[i+1][j]+T_p[i][j-1]+T_p[i][j+1]);
T_c[nx-i+1][j] = T_c[i][j];
}
}
copyT(T_p, T_c, nx);
}
print2file(T_c, nx, file);
The problem is when I run more than one threads, the computational time will be much longer. 问题是当我运行多个线程时,计算时间会更长。
First, your parallel region is restarted on each iteration of the outer loop, thus adding a huge overhead. 首先,在外循环的每次迭代中重新启动并行区域,从而增加了巨大的开销。
Second, half of the threads would be just sitting there doing nothing since your chunk size is twice as bigger as it should be - it is nx/nthreads
while the number of iterations of the parallel loop is nx/2
, hence there are (nx/2)/(nx/nthreads) = nthreads/2
chunks in total. 其次,一半的线程只是坐在那里什么也不做,因为你的块大小是它应该的两倍 - 它是
nx/nthreads
而并行循环的迭代次数是nx/2
,因此有(nx/2)/(nx/nthreads) = nthreads/2
块。 Besides what you have tried to achieve is to replicate the behaviour of schedule(static)
. 除了你试图实现的是复制
schedule(static)
的行为。
#pragma omp parallel
for (int t = 0; t < n; t++) {
#pragma omp for schedule(static)
for (int i = 1; i < nx/2+1; i++) {
for (int j = 1; j < nx-1; j++) {
T_c[i][j] = 0.25*(T_p[i-1][j]+T_p[i+1][j]+T_p[i][j-1]+T_p[i][j+1]);
T_c[nx-i-1][j] = T_c[i][j];
}
}
#pragma omp single
copyT(T_p, T_c, nx);
}
print2file(T_c, nx, file);
If you modify copyT
to also use parallel for
, then the single
construct should be removed. 如果修改
copyT
也使用parallel for
,则应删除single
构造。 You do not need default(shared)
as this is the default. 您不需要
default(shared)
因为这是默认设置。 You do not to declare the loop variable of a parallel loop private
- even if this variable comes from an outer scope (and hence is implicitly shared in the region), OpenMP automatically makes it private. 您不要将并行循环的循环变量声明为
private
- 即使此变量来自外部作用域(因此在区域中隐式共享),OpenMP也会自动将其设置为私有。 Simply declare all loop variables in the loop controls and it works automagically with the default sharing rules applied. 只需在循环控件中声明所有循环变量,它就会自动运行并应用默认的共享规则。
Second and a half, there is (probably) an error in your inner loop. 第二个半月,你的内循环中可能存在(可能)错误。 The second assingment statement should read:
第二个分配声明应为:
T_c[nx-i-1][j] = T_c[i][j];
(or T_c[nx-i][j]
if you do not keep a halo on the lower side) otherwise when i
equals 1
, then you would be accessing T_c[nx][...]
which is outside the bounds of T_c
. (或
T_c[nx-i][j]
如果不保持对下侧的卤素),否则当i
等于1
,那么你将被访问T_c[nx][...]
是的边界之外T_c
。
Third, a general hint: instead of copying one array into another, use pointers to those arrays and just swap the two pointers at the end of each iteration. 第三,一般提示:不是将一个数组复制到另一个数组,而是使用指向这些数组的指针,并在每次迭代结束时交换两个指针。
I see at least three problems that could lead to bad performance in the snippet you posted: 我发现至少有三个问题可能会导致您发布的代码段性能下降:
parallel
region inside a loop may hurt performance. parallel
区域的打开和关闭可能会损害性能。 You can find below a trace of some modifications I would do on the code: 您可以在下面找到我将对代码进行的一些修改:
// Moving the omp parallel you open/close the parallel
// region only one time, not n times
#pragma omp parallel default(shared)
for(int t = 0; t < n; t++){
// With collapse you parallelize over an iteration space that is
// composed of (nx/2+1)*(nx-1) elements not only (nx/2+1)
#pragma omp for collapse(2) schedule(static)
for(int i = 1; i < nx/2+1; i++){
for(int j = 1; j < nx-1; j++){
T_c[i][j] =0.25*(T_p[i-1][j] +T_p[i+1][j]+T_p[i][j-1]+T_p[i][j+1]);
T_c[nx-i+1][j] = T_c[i][j];
}
}
// As the iteration space is very small and the work done
// at each iteration is not much, static schedule will likely be the best option
// as it is the one that adds the least overhead for scheduling
copyT(T_p, T_c, nx);
}
print2file(T_c, nx, file);
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.