Multiplying a large square matrix by it's transpose being slower than large square matrix just multiplying… How to fix?

Question

Apparently, transposing a matrix then multiplying it is faster than just multiplying the two matrices. However, my code right now does not do that and I have no clue why... (The normal multiplying is just the triple-nested-for loop and it gives me roughly 1.12secs to multiply a 1000x1000 matrix whilst this code gives me 8 times the time(so slower instead of faster)... I am lost now any help would be appreciated! :D

A = malloc (size*size * sizeof (double));
B = malloc (size*size * sizeof (double));
C = malloc (size*size * sizeof (double));



/* initialise array elements */
for (row = 0; row < size; row++){
    for (col = 0; col < size; col++){
      A[size * row + col] = rand();
      B[size * row + col] = rand();
    }
  }

t1 = getTime();

/* code to be measured goes here */

T = malloc (size*size * sizeof(double));

for(i = 0; i < size; ++i) {
  for(j = 0; j <= i ; ++j) {
    T[size * i + j] = B[size * j + i];
  }
}

for (j = 0; j < size; ++j) {
  for (k = 0; k < size; ++k) {
    for (m = 0; m < size; ++m) {
      C[size * j + k] = A[size * j + k] * T[size * m + k];
        }
  }
}


t2 = getTime();

Answer 1

I see couple of problems.

You are just setting the value of C[size * j + k] instead of incrementing it. Even though this is an error in the computation, it shouldn't impact performance. Also, you need to initialize C[size * j + k] to 0.0 before the innermost loop starts. Otherwise, you will be incrementing an uninitialized value. That is a serious problem that could result in overflow.

The multiplication term is wrong.

Remember that your multiplication term needs to represent:

  C[j, k] += A[j, m] * B[m, k], which is C[j, k] += A[j, m] * T[k, m]

Instead of

  C[size * j + k] = A[size * j + k] * T[size * m + k];

you need

  C[size * j + k] += A[size * j + m] * T[size * k + m]; // ^ ^ ^^^^^^^^^^^^^^^^ // | | Need to get T[k, m], not T[m, k] // | ^^^^^^^^^^^^^^^^ // | Need to get A[j, m], not A[j, k] // ^^^^ Increment, not set.

I think the main culprit that hurts performance, in addition to it being wrong, is your use of T[size * m + k] . When you do that, there is a lot of jumping of memory ( m is the fastest changing variable in the loop) to get to the data. When you use the correct term, T[size * k + m] , there will be less of that and you should see a performance improvement.

In summary, use:

for (j = 0; j < size; ++j) {
   for (k = 0; k < size; ++k) {
      C[size * j + k] = 0.0;
      for (m = 0; m < size; ++m) {
         C[size * j + k] += A[size * j + m] * T[size * k + m];
      }
   }
}

You might be able to get a little bit more performance by using:

double* a = NULL;
double* c = NULL;
double* t = NULL;

for (j = 0; j < size; ++j) {
   a = A + (size*j);
   c = C + (size*j);
   for (k = 0; k < size; ++k) {
      t = T + size*k;
      c[k] = 0.0;
      for (m = 0; m < size; ++m) {
         c[k] += a[m] * t[m];
      }
   }
}

PS I haven't tested the code. Just giving you some ideas.

Answer 2

It is likely that your transpose runs slower than the multiplication in this test because the transpose is where the data is loaded from memory into cache, while the matrix multiplication runs out of cache, at least for 1000x1000 with many modern processors (24 MB fits into cache on many Intel Xeon processors).

In any case, both your transpose and multiplication are horribly inefficient. Your transpose is going to thrash the TLB, so you should use a blocking factor of 32 or so (see https://github.com/ParRes/Kernels/blob/master/SERIAL/Transpose/transpose.c for example code).

Furthermore, on x86, it is better to write contiguously (due to how cache-line locking and blocking stores work - if you use nontemporal stores carefully, this might change), whereas on some variants of PowerPC, in particular ,the Blue Gene variants, you want to read contiguously (because of in-order execution, nonblocking stores and write-through cache). See https://github.com/jeffhammond/HPCInfo/blob/master/tuning/transpose/transpose.c for example code.

Finally, I don't care what you say ("I specifically have to do it this way though"), you need to use BLAS for the matrix multiplication. End of story. If your supervisor or some other coworker is telling you otherwise, they are incompetent and should not be allowed to talk about code until they have been thoroughly reeducated. Please refer them to this post if you don't feel like telling them yourself.

Multiplying a large square matrix by it's transpose being slower than large square matrix just multiplying… How to fix?

Question

2 answers

solution1
1 ACCPTED 2014-10-14 07:07:30

solution2
0 2015-01-01 23:12:06

Multiplying a large square matrix by it's transpose being slower than large square matrix just multiplying… How to fix?

Question

2 answers

solution1 1 ACCPTED 2014-10-14 07:07:30

solution2 0 2015-01-01 23:12:06

solution1
1 ACCPTED 2014-10-14 07:07:30

solution2
0 2015-01-01 23:12:06