为什么使用私有动态数组时性能最差

Question

I want to use OpenMP to parallelize a for-loop calculator which does something like:我想使用 OpenMP 来并行化一个 for-loop 计算器，它执行以下操作：

B = (int*)malloc(sizeof(int) * N); //N is known
for(i=0;i<500000;i++)
{  
    for(j=0;j<M;j++) B[j]=i+j;  //M is different from N, but M <= N;
    some operations on B which produce a variable L;
    printf("%d\n",L);    
}

I don't need to re-allocate B as its values will be defined for each iteration accordingly.我不需要重新分配 B，因为它将为每次迭代相应地定义它的值。 The operations will only use B[0] to B[M-1].这些操作将只使用 B[0] 到 B[M-1]。 This saves a lot of time in allocating and initialization of B.这在 B 的分配和初始化方面节省了大量时间。

In order to use openmp, I changed the code to this:为了使用 openmp，我将代码更改为：

#pragma omp parallel num_threads(32) private(i,j,B,M,L)
{
  B = (int*)malloc(sizeof(int) * N); //N is known
  #pragma omp parallel for 
  for(i=0;i<500000;i++)
  {  
      for(j=0;j<M;j++) B[j]=i+j;  //M is different from N, but M <= N;
      some operations on B which produce a variable L;
      printf("%d\n",L);    
  }
}

It runs really slow compared to the first one, as it creates a new B array for each thread (so 500000 times).与第一个相比，它的运行速度非常慢，因为它为每个线程创建了一个新的 B 数组（所以 500000 次）。 Is there a way to avoid this using openmp?有没有办法使用openmp来避免这种情况？

Answer 1

The main issue is that the iterations of the loop are not being assigned to threads as you wanted.主要问题是循环的迭代没有按照您的意愿分配给线程。 Because you have added again the clause parallel to #pragma omp for , and assuming that you have nested parallelism disabled, which by default it is, each of the threads created in the outer parallel region will execute "sequentially" the code within that region, namely:因为您再次添加了与#pragma omp for parallel的子句，并假设您已禁用嵌套并行性，默认情况下，在外部parallel区域中创建的每个线程都将“按顺序”执行该区域内的代码，即：

  #pragma omp parallel for 
  for(i=0;i<500000;i++){  
      ...
  }

Therefore, each thread will execute all the 500000 iterations of the inner loop that you intended to be parallelized.因此，每个线程将执行您打算并行化的内部循环的所有500000次迭代。 Consequently, removing the parallelism and adding additional overhead ( eg, thread creation) to the sequential code.因此，消除了并行性并为顺序代码增加了额外的开销（例如，线程创建）。 Nonetheless, one can easily solve this issue by merely removing the second parallel clause, namely:尽管如此，只需删除第二个parallel子句即可轻松解决此问题，即：

#pragma omp parallel num_threads(32) private(i,j,B,M,L)
{
    B = (int*)malloc(sizeof(int) * N); //N is known
    #pragma omp for 
    for(i=0;i<500000;i++){  
      ...   
    }
}

Depending upon the setup where the code will be executed ( eg, in a NUMA architecture or not, if the malloc function used is (or not) thread-aware memory allocator, among others) it might be advisable to profile your parallel region to check if it pays off (or not) to move the allocation of the 2D array to the outside of that region.根据将执行代码的设置（例如，在NUMA架构中与否，如果malloc function 使用的是（或不是）线程感知 ZCD69B4957F06CD818DvisZBF3D61980E21）它可能建议与所有配置文件并行检查如果将2D数组的分配移动到该区域的外部是否有回报（或没有回报）。 An example, of what the alternative version might look like:替代版本的示例：

int total_threads = 32;
int** B = malloc(sizeof(*int) * total_threads);
for(int i = 0; i < total_threads; i++){
    B[i] = malloc(N * sizeof(int));
}

#pragma omp parallel num_threads(32) private(i,j,M,L)
{
  int threadID = omp_get_thread_num();
  #pragma omp for 
  for(i=0;i<500000;i++)
  {  
      for(j=0;j<M;j++) 
          B[threadID][j]=i+j;  //M is different from N, but M <= N;
      some operations on B which produce a variable L;
      printf("%d\n",L);    
  }
}
// you might need to reduce all the values from all threads
// to main thread array.

为什么使用私有动态数组时性能最差

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-12-26 22:10:28

为什么使用私有动态数组时性能最差

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-12-26 22:10:28

解决方案1
2 已采纳 2020-12-26 22:10:28