Why am I getting worst performance with a private dynamic array

Question

I want to use OpenMP to parallelize a for-loop calculator which does something like:

B = (int*)malloc(sizeof(int) * N); //N is known
for(i=0;i<500000;i++)
{  
    for(j=0;j<M;j++) B[j]=i+j;  //M is different from N, but M <= N;
    some operations on B which produce a variable L;
    printf("%d\n",L);    
}

I don't need to re-allocate B as its values will be defined for each iteration accordingly. The operations will only use B[0] to B[M-1]. This saves a lot of time in allocating and initialization of B.

In order to use openmp, I changed the code to this:

#pragma omp parallel num_threads(32) private(i,j,B,M,L)
{
  B = (int*)malloc(sizeof(int) * N); //N is known
  #pragma omp parallel for 
  for(i=0;i<500000;i++)
  {  
      for(j=0;j<M;j++) B[j]=i+j;  //M is different from N, but M <= N;
      some operations on B which produce a variable L;
      printf("%d\n",L);    
  }
}

It runs really slow compared to the first one, as it creates a new B array for each thread (so 500000 times). Is there a way to avoid this using openmp?

Answer 1

The main issue is that the iterations of the loop are not being assigned to threads as you wanted. Because you have added again the clause parallel to #pragma omp for , and assuming that you have nested parallelism disabled, which by default it is, each of the threads created in the outer parallel region will execute "sequentially" the code within that region, namely:

  #pragma omp parallel for 
  for(i=0;i<500000;i++){  
      ...
  }

Therefore, each thread will execute all the 500000 iterations of the inner loop that you intended to be parallelized. Consequently, removing the parallelism and adding additional overhead ( eg, thread creation) to the sequential code. Nonetheless, one can easily solve this issue by merely removing the second parallel clause, namely:

#pragma omp parallel num_threads(32) private(i,j,B,M,L)
{
    B = (int*)malloc(sizeof(int) * N); //N is known
    #pragma omp for 
    for(i=0;i<500000;i++){  
      ...   
    }
}

Depending upon the setup where the code will be executed ( eg, in a NUMA architecture or not, if the malloc function used is (or not) thread-aware memory allocator, among others) it might be advisable to profile your parallel region to check if it pays off (or not) to move the allocation of the 2D array to the outside of that region. An example, of what the alternative version might look like:

int total_threads = 32;
int** B = malloc(sizeof(*int) * total_threads);
for(int i = 0; i < total_threads; i++){
    B[i] = malloc(N * sizeof(int));
}

#pragma omp parallel num_threads(32) private(i,j,M,L)
{
  int threadID = omp_get_thread_num();
  #pragma omp for 
  for(i=0;i<500000;i++)
  {  
      for(j=0;j<M;j++) 
          B[threadID][j]=i+j;  //M is different from N, but M <= N;
      some operations on B which produce a variable L;
      printf("%d\n",L);    
  }
}
// you might need to reduce all the values from all threads
// to main thread array.

Why am I getting worst performance with a private dynamic array

Question

1 answers

solution1
2 ACCPTED 2020-12-26 22:10:28

Why am I getting worst performance with a private dynamic array

Question

1 answers

solution1 2 ACCPTED 2020-12-26 22:10:28

solution1
2 ACCPTED 2020-12-26 22:10:28