OpenMP parallel code slower

Question

I have two loops which I am parallelizing

#pragma omp parallel for
  for (i = 0; i < ni; i++)
    for (j = 0; j < nj; j++) {
      C[i][j] = 0;
      for (k = 0; k < nk; ++k)
        C[i][j] += A[i][k] * B[k][j];
    }
#pragma omp parallel for
  for (i = 0; i < ni; i++)
    for (j = 0; j < nl; j++) {
      E[i][j] = 0;
      for (k = 0; k < nj; ++k)
        E[i][j] += C[i][k] * D[k][j];
    }

Strangely the sequential execution is much faster than parallel version above even using large number of threads. Am I doing something wrong? Note that all arrays are global. Does this make that difference?

Answer 1

The iterations of your parallel outer loops share the index variables ( j and k ) of their inner loops. This for sure makes your code somewhat slower than you probably expected it to be, ie, your loops are not "embarrassingly" (or "delightfully") parallel and parallel loop iterations need to somehow accesses these variables from shared memory.

What is worse, is that, because of this, your code contains race conditions . As a result, it will behave nondeterministically. In other words: your implementation of parallel matrix multiplication is now incorrect! (Go ahead and check the results of your computations. ;))

What you want to do is make sure that all iterations of your outer loops have their own private copies of the index variables j and k . You can achieve this either by declaring these variables within the scope of the parallel loops:

int i;

#pragma omp parallel for
  for (i = 0; i < ni; i++) {
    int j1, k1;  /* explicit local copies */
    for (j1 = 0; j1 < nj; j1++) {
      C[i][j1] = 0;
      for (k1 = 0; k1 < nk; ++k1)
        C[i][j1] += A[i][k1] * B[k1][j1];
    }
  }        
#pragma omp parallel for
  for (i = 0; i < ni; i++) {
    int j2, k2;  /* explicit local copies */
    for (j2 = 0; j2 < nl; j2++) {
      E[i][j2] = 0;
      for (k2 = 0; k2 < nj; ++k2)
        E[i][j2] += C[i][k2] * D[k2][j2];
    }
  }

or otherwise declaring them as private in your loop pragmas:

int i, j, k;

#pragma omp parallel for private(j, k)
  for (i = 0; i < ni; i++)
    for (j = 0; j < nj; j++) {
      C[i][j] = 0;
      for (k = 0; k < nk; ++k)
        C[i][j] += A[i][k] * B[k][j];
    }
#pragma omp parallel for private(j, k)
  for (i = 0; i < ni; i++)
    for (j = 0; j < nl; j++) {
      E[i][j] = 0;
      for (k = 0; k < nj; ++k)
        E[i][j] += C[i][k] * D[k][j];
    }

Will these changes make your parallel implementation faster than your sequential implementation? Hard to say. It depends on your problem size. Parallelisation (in particular parallelisation through OpenMP) comes with some overhead. Only if you spawn enough parallel work, the gain of distributing work over parallel threads will outweigh the incurred overhead costs.

To find out how much work is enough for your code and your software/hardware platform, I advise to experiment by running your code with different matrix sizes. Then, if you also expect "too" small matrix sizes as inputs for your computation you may want to make parallel processing conditional (for example, by decorating your loop pragmas with an if -clauses):

#pragma omp parallel for private (j, k) if(ni * nj * nk > THRESHOLD)
  for (i = 0; i < ni; i++) {
     ...
  }
#pragma omp parallel for private (j, k) if(ni * nl * nj > THRESHOLD)
  for (i = 0; i < ni; i++) {
    ...
  }

OpenMP parallel code slower

Question

1 answers

solution1
4 ACCPTED 2014-12-01 20:16:33

OpenMP parallel code slower

Question

1 answers

solution1 4 ACCPTED 2014-12-01 20:16:33

solution1
4 ACCPTED 2014-12-01 20:16:33