[英]Why am I getting worst performance with a private dynamic array
I want to use OpenMP to parallelize a for-loop calculator which does something like:我想使用 OpenMP 来并行化一个 for-loop 计算器,它执行以下操作:
B = (int*)malloc(sizeof(int) * N); //N is known
for(i=0;i<500000;i++)
{
for(j=0;j<M;j++) B[j]=i+j; //M is different from N, but M <= N;
some operations on B which produce a variable L;
printf("%d\n",L);
}
I don't need to re-allocate B as its values will be defined for each iteration accordingly.我不需要重新分配 B,因为它将为每次迭代相应地定义它的值。 The operations will only use B[0] to B[M-1].
这些操作将只使用 B[0] 到 B[M-1]。 This saves a lot of time in allocating and initialization of B.
这在 B 的分配和初始化方面节省了大量时间。
In order to use openmp, I changed the code to this:为了使用 openmp,我将代码更改为:
#pragma omp parallel num_threads(32) private(i,j,B,M,L)
{
B = (int*)malloc(sizeof(int) * N); //N is known
#pragma omp parallel for
for(i=0;i<500000;i++)
{
for(j=0;j<M;j++) B[j]=i+j; //M is different from N, but M <= N;
some operations on B which produce a variable L;
printf("%d\n",L);
}
}
It runs really slow compared to the first one, as it creates a new B array for each thread (so 500000 times).与第一个相比,它的运行速度非常慢,因为它为每个线程创建了一个新的 B 数组(所以 500000 次)。 Is there a way to avoid this using openmp?
有没有办法使用openmp来避免这种情况?
The main issue is that the iterations of the loop are not being assigned to threads as you wanted.主要问题是循环的迭代没有按照您的意愿分配给线程。 Because you have added again the clause
parallel
to #pragma omp for
, and assuming that you have nested parallelism disabled, which by default it is, each of the threads created in the outer parallel
region will execute "sequentially" the code within that region, namely:因为您再次添加了与
#pragma omp for
parallel
的子句,并假设您已禁用嵌套并行性,默认情况下,在外部parallel
区域中创建的每个线程都将“按顺序”执行该区域内的代码,即:
#pragma omp parallel for
for(i=0;i<500000;i++){
...
}
Therefore, each thread will execute all the 500000
iterations of the inner loop that you intended to be parallelized.因此,每个线程将执行您打算并行化的内部循环的所有
500000
次迭代。 Consequently, removing the parallelism and adding additional overhead ( eg, thread creation) to the sequential code.因此,消除了并行性并为顺序代码增加了额外的开销(例如,线程创建)。 Nonetheless, one can easily solve this issue by merely removing the second
parallel
clause, namely:尽管如此,只需删除第二个
parallel
子句即可轻松解决此问题,即:
#pragma omp parallel num_threads(32) private(i,j,B,M,L)
{
B = (int*)malloc(sizeof(int) * N); //N is known
#pragma omp for
for(i=0;i<500000;i++){
...
}
}
Depending upon the setup where the code will be executed ( eg, in a NUMA
architecture or not, if the malloc
function used is (or not) thread-aware memory allocator, among others) it might be advisable to profile your parallel region to check if it pays off (or not) to move the allocation of the 2D
array to the outside of that region.根据将执行代码的设置(例如,在
NUMA
架构中与否,如果malloc
function 使用的是(或不是)线程感知 ZCD69B4957F06CD818DvisZBF3D61980E21)它可能建议与所有配置文件并行检查如果将2D
数组的分配移动到该区域的外部是否有回报(或没有回报)。 An example, of what the alternative version might look like:替代版本的示例:
int total_threads = 32;
int** B = malloc(sizeof(*int) * total_threads);
for(int i = 0; i < total_threads; i++){
B[i] = malloc(N * sizeof(int));
}
#pragma omp parallel num_threads(32) private(i,j,M,L)
{
int threadID = omp_get_thread_num();
#pragma omp for
for(i=0;i<500000;i++)
{
for(j=0;j<M;j++)
B[threadID][j]=i+j; //M is different from N, but M <= N;
some operations on B which produce a variable L;
printf("%d\n",L);
}
}
// you might need to reduce all the values from all threads
// to main thread array.
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.