通过SSE2加速矩阵乘法

Question

I want to know how speed up matrix multiplication by SSE2 我想知道如何通过SSE2加速矩阵乘法

here is my code 这是我的代码

int mat_mult_simd(double *a, double *b, double *c, int n)
{
   __m128d c1,c2,a1,a2,b1;

   for(int i=0; i<n/2; i++){
      for(int j=0; j<n/2; j++){
          c1 = _mm_load_pd(c+(2*j*n)+(i+2));
          c2 = _mm_load_pd(c+n+(2*j*n)+(i+2));
          for(int k=0; k<n; k++){
             a1 = _mm_load1_pd(a+k+(2*j*n));
             a2 = _mm load1_pd(a+n+k+(2*j*n));
             b1 = _mm_load_pd(b+(k*n)+(i*2));
             c1 = _mm_add_pd(c1, _mm_mul_pd(a1,b1));
             c2 = _mm_add_pd(c2, _mm_mul_pd(a2,b1));
          }
          __mm_store_pd(c+(2*j*n)+(i+2), c1);
          __mm_store_pd(c+n+(2*j*n)+(i+2), c2);
      }
   }
   return 0;
}

each parameter means 每个参数都意味着

'a' = vector a(MAT_SIZE*MAT_SIZE) 'a'=向量a（MAT_SIZE * MAT_SIZE）

'b' = vector b(MAT_SIZE*MAT_SIZE) 'b'=向量b（MAT_SIZE * MAT_SIZE）

'c' = vector c(MAT_SIZE*MAT_SIZE) 'c'=向量c（MAT_SIZE * MAT_SIZE）

'n' = MAT_SIZE is constant (It always even and >=2) 'n'= MAT_SIZE是常量（它总是偶数且> = 2）

this code speed up about X4. 这段代码加速了大约X4。 against 反对

int mat_mult_default(double *a, double *b, double *c, int n)
{
 double t;
 for(int i=0; i<n; i++){
    for(int j=0; j<n; j++){
    t=0.0;
    for(int k=0; k<n; k++)
       t += a[i*n+k] * b[k*n+j];
    c[i*n+j] = t;
    }
 }
}

but I want to more speed up. 但我想加快速度。 I usually experiment MAT_SIZE 1000*1000 or 2000*2000. 我通常试验MAT_SIZE 1000 * 1000或2000 * 2000。 how can i speed up? 我怎么能加快速度？ Is there other way to indexing? 还有其他方法可以编制索引吗？ I really want to know. 我真的很想知道。 thanks. 谢谢。

Answer 1

You can do a few things. 你可以做一些事情。 The obvious one is splitting the work into several threads (1 per core). 显而易见的是将工作分成几个线程（每个核心1个）。 You can use OpenMP (easiest), Intel TBB or other multithreading libs. 您可以使用OpenMP（最简单），Intel TBB或其他多线程库。 This will provide a significant improvement on a multi-core machine. 这将为多核机器提供显着的改进。

Another thing is to looks at the disassembly (via your favorite debugger) - look how the compiler handles all the multiplications you use for the indexes, some of them can be eliminated. 另一件事是查看反汇编（通过您最喜欢的调试器） - 看看编译器如何处理您用于索引的所有乘法，其中一些可以被消除。

Your code does 2 computations in one loop, try to do more 4 or 8 to have better locality. 您的代码在一个循环中执行2次计算，尝试执行更多4或8以获得更好的位置。 Eg a1 and a2 couldbe calculated with their neighbors who are already in the L1 cache. 例如，a1和a2可以与已经在L1高速缓存中的邻居一起计算。 You can actually loads them with a single load operation. 您可以使用单个加载操作实际加载它们。

Make sure the various arrays are SSE aligned (16 Byte) and change your code to use aligned reads/writes. 确保各种数组是SSE对齐的（16字节）并更改代码以使用对齐的读/写。

I'd leave multithreading to the end as finding bugs is harder. 我将多线程留到最后，因为发现错误更难。

Answer 2

Just use the right library like the Intel Math Kernel Library or a similar highly optimized linear algebra package (OpenBLAS, AMD Core Math Library, ATLAS, ...). 只需使用正确的库，如英特尔数学核心库或类似的高度优化的线性代数包（OpenBLAS，AMD核心数学库，ATLAS，......）。 They are considered faster compared hand-written code. 与手写代码相比，它们被认为更快。 They have sometimes even processor-specific optimizations for instruction sets and cache sizes. 它们有时甚至针对指令集和高速缓存大小进行特定于处理器的优化。 And they are professionals in their field. 他们是各自领域的专业人士。 Unless you plan to publish a paper on your own optimization, the go with the library. 除非您打算发表关于自己优化的论文，否则请继续使用库。

In the latest issue of the German computer magazine c't they claim the compiler is smart enough to use SSE or AVX by itself. 在最新一期的德国计算机杂志中，他们声称编译器足够聪明，可以单独使用SSE或AVX。 Just write the right loops and the auto-vectorizer will bring the best results. 只需编写正确的循环，自动矢量化器将带来最佳效果。 This is true for the latest Intel compiler. 这适用于最新的英特尔编译器。 Microsoft's compiler are too dump. 微软的编译器太垃圾了。 In some cases with the right compiler flags, Intel's compiler even detects that you program a matrix multiplication and replaces this by the right call. 在某些情况下，如果使用正确的编译器标志，英特尔的编译器甚至会检测到您编写矩阵乘法并通过正确的调用替换它。 Or you have to check the documentation, it is not that hard to learn such a package. 或者你必须检查文档，学习这样的包并不难。

通过SSE2加速矩阵乘法

问题描述

2 个解决方案

解决方案1
1 2014-06-05 21:26:11

解决方案2
-1 2014-06-04 08:14:51

通过SSE2加速矩阵乘法

问题描述

2 个解决方案

解决方案1 1 2014-06-05 21:26:11

解决方案2 -1 2014-06-04 08:14:51

解决方案1
1 2014-06-05 21:26:11

解决方案2
-1 2014-06-04 08:14:51