OpenMP C++ matrix multiplication

Question

I'd like to parallelize the following code. Especially these for loops, since it is the most expensive operation.

      for (i = 0; i < d1; i++)
         for (j = 0; j < d3; j++)
             for (k = 0; k < d2; k++)
             C[i][j] = C[i][j] + A[i][k] * B[k][j];

It is the first time I tried parallelizing code using OpenMP. I have tried several things but I always end up having a worse runtime than by using the serial version. It would be great if u could tell me if there is something wrong with the code or the pragmas...

      #include <omp.h>
      #include <stdio.h>
      #include <stdlib.h>
      //#include <stdint.h>

      // ---------------------------------------------------------------------------
      // allocate space for empty matrix A[row][col]
      // access to matrix elements possible with:
      // - A[row][col]
      // - A[0][row*col]


      float **alloc_mat(int row, int col)
      {
          float **A1, *A2;

          A1 = (float **)calloc(row, sizeof(float *));      // pointer on rows
          A2 = (float *)calloc(row*col, sizeof(float));    // all matrix elements

          //#pragma omp parallel for
          for (int i=0; i<row; i++)
              A1[i] = A2 + i*col;

          return A1;
      }

      // ---------------------------------------------------------------------------
      // random initialisation of matrix with values [0..9]

      void init_mat(float **A, int row, int col)
      {   
          //#pragma omp parallel for
          for (int i = 0; i < row*col; i++)
              A[0][i] = (float)(rand() % 10);
      }

      // ---------------------------------------------------------------------------
      // DEBUG FUNCTION: printout of all matrix elements

      void print_mat(float **A, int row, int col, char *tag)
      {
          int i, j;

          printf("Matrix %s:\n", tag);
          for (i = 0; i < row; i++)
          {
              //#pragma omp parallel for
              for (j=0; j<col; j++) 
                  printf("%6.1f   ", A[i][j]);
              printf("\n"); 
          }
      }

      // ---------------------------------------------------------------------------

      int main(int argc, char *argv[])
      {
          float **A, **B, **C;  // matrices
          int d1, d2, d3;         // dimensions of matrices
          int i, j, k;          // loop variables


          double start, end;
          start = omp_get_wtime();

          /* print user instruction */
          if (argc != 4)
          {
              printf ("Matrix multiplication: C = A x B\n");
              printf ("Usage: %s <NumRowA>; <NumColA> <NumColB>\n",argv[0]); 
               return 0;
           }

           /* read user input */
           d1 = atoi(argv[1]);      // rows of A and C
           d2 = atoi(argv[2]);     // cols of A and rows of B
           d3 = atoi(argv[3]);     // cols of B and C

           printf("Matrix sizes C[%d][%d] = A[%d][%d] x B[%d][%d]\n", 
           d1, d3, d1, d2, d2, d3);

           /* prepare matrices */
           A = alloc_mat(d1, d2);
           init_mat(A, d1, d2); 
           B = alloc_mat(d2, d3);
           init_mat(B, d2, d3);
           C = alloc_mat(d1, d3);   // no initialisation of C, 
       //because it gets filled by matmult

           /* serial version of matmult */
           printf("Perform matrix multiplication...\n");



           int sum;
           //#pragma omp parallel
           //{
               #pragma omp parallel for collapse(3) schedule(guided)
               for (i = 0; i < d1; i++)
                   for (j = 0; j < d3; j++)
                       for (k = 0; k < d2; k++){
                       C[i][j] = C[i][j] + A[i][k] * B[k][j];
                       }
           //}


           end = omp_get_wtime();


           /* test output */
           print_mat(A, d1, d2, "A"); 
           print_mat(B, d2, d3, "B"); 
           print_mat(C, d1, d3, "C"); 

           printf("This task took %f seconds\n", end-start);
           printf ("\nDone.\n");

           return 0;
       }

Answer 1

As @genisage suggested in the comments, the size of matrix is likely small enough that the overhead of initializing the additional threads is greater than the time savings achieved by computing the matrix multiplication in parallel. Consider the following plot, however, with data that I obtained by running your code with and without OpenMP. 串行与并行矩阵乘法比较

I used square matrices ranging from n=10 to n=1000. Notice how somewhere between n=50 and n=100 the parallel version becomes faster.

There are other issues to consider, however, when trying to write fast matrix multiplication, which mostly have to do with using the cache effectively. First, you allocate your entire matrix contiguously (which is good), but still use two pointer redirections to access the data, which is unnecessary. Also, your matrices are stored in row major format, which means you are accessing the data in A and C contiguously, but not in B. Instead of explicitly storing B and multiplying a row of A with a column of B, you would get a faster multiplication by storing B transposed and multiplying a row of A elementwise with a row of B transpose.

This is an optimization focused only on A*B, however, and there may be other places in your code where storing B is better than B transpose, in which case often doing matrix multiplication by blocking can lead to better cache utilization

OpenMP C++ matrix multiplication

Question

1 answers

solution1
2 2014-11-20 00:55:27

OpenMP C++ matrix multiplication

Question

1 answers

solution1 2 2014-11-20 00:55:27

solution1
2 2014-11-20 00:55:27