乘以大方阵的转置要比仅乘大方阵的速度慢…如何解决？

Question

Apparently, transposing a matrix then multiplying it is faster than just multiplying the two matrices. 显然，转置矩阵然后乘以它比仅将两个矩阵相乘要快。 However, my code right now does not do that and I have no clue why... (The normal multiplying is just the triple-nested-for loop and it gives me roughly 1.12secs to multiply a 1000x1000 matrix whilst this code gives me 8 times the time(so slower instead of faster)... I am lost now any help would be appreciated! :D 但是，我的代码现在没有这样做，我也不知道为什么...（正常的乘法只是三重嵌套循环，它给了我大约1.12秒的时间来乘以1000x1000矩阵，而这段代码给了我8时间（慢而不是快）倍...我迷失了，任何帮助将不胜感激！：D

A = malloc (size*size * sizeof (double));
B = malloc (size*size * sizeof (double));
C = malloc (size*size * sizeof (double));



/* initialise array elements */
for (row = 0; row < size; row++){
    for (col = 0; col < size; col++){
      A[size * row + col] = rand();
      B[size * row + col] = rand();
    }
  }

t1 = getTime();

/* code to be measured goes here */ / *要测量的代码在这里* /

T = malloc (size*size * sizeof(double));

for(i = 0; i < size; ++i) {
  for(j = 0; j <= i ; ++j) {
    T[size * i + j] = B[size * j + i];
  }
}

for (j = 0; j < size; ++j) {
  for (k = 0; k < size; ++k) {
    for (m = 0; m < size; ++m) {
      C[size * j + k] = A[size * j + k] * T[size * m + k];
        }
  }
}


t2 = getTime();

Answer 1

I see couple of problems. 我看到几个问题。

You are just setting the value of C[size * j + k] instead of incrementing it. 您只是在设置C[size * j + k]的值，而不是对其进行递增。 Even though this is an error in the computation, it shouldn't impact performance. 即使这是计算错误，也不应影响性能。 Also, you need to initialize C[size * j + k] to 0.0 before the innermost loop starts. 另外，您需要在最内层循环开始之前将C[size * j + k]初始化为0.0 。 Otherwise, you will be incrementing an uninitialized value. 否则，您将增加一个未初始化的值。 That is a serious problem that could result in overflow. 这是一个严重的问题，可能导致溢出。

The multiplication term is wrong. 乘法项是错误的。

Remember that your multiplication term needs to represent: 请记住，您的乘法项需要表示：

  C[j, k] += A[j, m] * B[m, k], which is C[j, k] += A[j, m] * T[k, m]

Instead of 代替

  C[size * j + k] = A[size * j + k] * T[size * m + k];

you need 你需要

  C[size * j + k] += A[size * j + m] * T[size * k + m]; // ^ ^ ^^^^^^^^^^^^^^^^ // | | Need to get T[k, m], not T[m, k] // | ^^^^^^^^^^^^^^^^ // | Need to get A[j, m], not A[j, k] // ^^^^ Increment, not set.

I think the main culprit that hurts performance, in addition to it being wrong, is your use of T[size * m + k] . 我认为，除了错误之外，影响性能的主要原因是您使用T[size * m + k] 。 When you do that, there is a lot of jumping of memory ( m is the fastest changing variable in the loop) to get to the data. 当您这样做时，会有很多内存跳跃（ m是循环中变化最快的变量）才能获取数据。 When you use the correct term, T[size * k + m] , there will be less of that and you should see a performance improvement. 当您使用正确的术语T[size * k + m] ，将更少，您应该会看到性能上的提高。

In summary, use: 总之，使用：

for (j = 0; j < size; ++j) {
   for (k = 0; k < size; ++k) {
      C[size * j + k] = 0.0;
      for (m = 0; m < size; ++m) {
         C[size * j + k] += A[size * j + m] * T[size * k + m];
      }
   }
}

You might be able to get a little bit more performance by using: 您可以使用以下方法获得更多性能：

double* a = NULL;
double* c = NULL;
double* t = NULL;

for (j = 0; j < size; ++j) {
   a = A + (size*j);
   c = C + (size*j);
   for (k = 0; k < size; ++k) {
      t = T + size*k;
      c[k] = 0.0;
      for (m = 0; m < size; ++m) {
         c[k] += a[m] * t[m];
      }
   }
}

PS I haven't tested the code. PS我还没有测试代码。 Just giving you some ideas. 只是给你一些想法。

Answer 2

It is likely that your transpose runs slower than the multiplication in this test because the transpose is where the data is loaded from memory into cache, while the matrix multiplication runs out of cache, at least for 1000x1000 with many modern processors (24 MB fits into cache on many Intel Xeon processors). 在此测试中，转置的运行速度可能比乘法运算的速度慢，因为转置是将数据从内存加载到高速缓存的位置，而矩阵乘法的运行空间不足，至少对于许多现代处理器而言，对于1000x1000而言（24 MB适合在许多Intel Xeon处理器上缓存）。

In any case, both your transpose and multiplication are horribly inefficient. 无论如何，转置和乘法都非常低效。 Your transpose is going to thrash the TLB, so you should use a blocking factor of 32 or so (see https://github.com/ParRes/Kernels/blob/master/SERIAL/Transpose/transpose.c for example code). 您的转置会影响TLB，因此您应该使用32左右的阻塞因子（例如，请参见https://github.com/ParRes/Kernels/blob/master/SERIAL/Transpose/transpose.c ）。

Furthermore, on x86, it is better to write contiguously (due to how cache-line locking and blocking stores work - if you use nontemporal stores carefully, this might change), whereas on some variants of PowerPC, in particular ,the Blue Gene variants, you want to read contiguously (because of in-order execution, nonblocking stores and write-through cache). 此外，在x86上，最好连续写入（由于高速缓存行锁定和阻塞存储的工作方式-如果仔细使用非临时存储，则可能会发生变化），而在PowerPC的某些变体上，尤其是Blue Gene变体，您希望连续读取（由于顺序执行，无阻塞存储和直写式缓存）。 See https://github.com/jeffhammond/HPCInfo/blob/master/tuning/transpose/transpose.c for example code. 参见https://github.com/jeffhammond/HPCInfo/blob/master/tuning/transpose/transpose.c作为示例代码。

Finally, I don't care what you say ("I specifically have to do it this way though"), you need to use BLAS for the matrix multiplication. 最后，我不在乎您说什么（“尽管我特别需要这样做”），您需要对矩阵乘法使用BLAS。 End of story. 故事结局。 If your supervisor or some other coworker is telling you otherwise, they are incompetent and should not be allowed to talk about code until they have been thoroughly reeducated. 如果您的主管或其他同事告诉您，否则，他们是无能的，除非经过全面的培训，否则不允许他们谈论代码。 Please refer them to this post if you don't feel like telling them yourself. 如果您不想自己告诉他们，请参考他们的信息。

乘以大方阵的转置要比仅乘大方阵的速度慢…如何解决？

问题描述

2 个解决方案

解决方案1
1 已采纳 2014-10-14 07:07:30

解决方案2
0 2015-01-01 23:12:06

乘以大方阵的转置要比仅乘大方阵的速度慢…如何解决？

问题描述

2 个解决方案

解决方案1 1 已采纳 2014-10-14 07:07:30

解决方案2 0 2015-01-01 23:12:06

解决方案1
1 已采纳 2014-10-14 07:07:30

解决方案2
0 2015-01-01 23:12:06