使用内联汇编来加速矩阵乘法

Question

I have been trying to speed up matrix-matrix multiplication C <- C + alpha * A * B via register blocking, SSE2 vectorization and L1 cache blocking (note that I have specially chosen the transpose setting op(A)=A and op(B)=B). 我一直试图通过寄存器阻塞，SSE2矢量化和L1缓存阻塞来加速矩阵 - 矩阵乘法C < - C + alpha * A * B（注意我已经特别选择了转置设置op（A）= A和op（ B）= B）。 After some effort my written code is still about 50% slower than GotoBLAS in single thread mode . 经过一些努力，我的书面代码在单线程模式下仍比GotoBLAS慢约50％ 。

The following is my code for the "kernel" square matrix-matrix multiplication on L1 cache, called "DGEBB" (general block-block operation) in Goto's work, that multiplies two NB*NB square matrices (NB restricted to be a multiple of 4). 下面是我在L1缓存中的“内核”方矩阵 - 矩阵乘法的代码，在Goto的工作中称为“DGEBB”（一般块 - 块操作），它将两个NB * NB平方矩阵相乘（NB限制为倍数的倍数） 4）。 I have examined its assembly output under GCC 4.8, realizing that the compiler is not doing a good job in scheduling the unrolled innermost loop: kk-loop. 我已经在GCC 4.8下检查了它的汇编输出，意识到编译器在调度展开的最内层循环方面做得不好：kk-loop。 What I hope is that the compiler optimizes register allocation to attain register reuse, and schedules the computation interleaving multiplication, addition and memory operation for pipelining; 我希望编译器优化寄存器分配以实现寄存器重用，并为流水线调度计算交错乘法，加法和存储操作; however, the compiler failed to do this. 但是，编译器无法执行此操作。 For this reason, I would like to replace the innermost loop by some inline assembly . 出于这个原因，我想用一些内联汇编替换最里面的循环 。

I am completely new to x86 assembly. 我是x86程序集的新手。 Though having read around for GCC's extended asm for hours, I am still not sure how to do it properly. 虽然已经读了几个小时的GCC 扩展的asm ，但我仍然不确定如何正确地做到这一点。 I have attached a stupid version I could write at my best, yet knowing it is wrong. 我附上了一个我能写得最好的愚蠢版本，但我知道这是错的。 This version is modified from the compiler's original assembly output for the kk-loop. 此版本是从编译器的kk-loop原始程序集输出中修改的。 As I know how to allocate register using "movl", "movapd", etc, I have re-arranged the computation in the order I fancy. 我知道如何使用“movl”，“movapd”等分配寄存器，我按照我喜欢的顺序重新安排了计算。 But It does not work yet. 但它还没有奏效。 1) It seems to me that registers %eax, %ebx, %ecx are used both inside and outside the assembly which is nasty. 1）在我看来，寄存器％eax，％ebx，％ecx在组件的内部和外部使用都是令人讨厌的。 2) Also, the way I pass the input and output operands does not work. 2）此外，我传递输入和输出操作数的方式不起作用。 3) Finally, I really want a version that the whole kk-loop can be inlined. 3）最后，我真的想要一个可以内联整个kk循环的版本。 Thanks if someone could helps me out! 谢谢，如果有人可以帮助我！

The C code for DGEBB (called DGEBB_SSE2_x86, as my laptop is 32-bit x86 machine, with SSE2 - SSE4.1 support): DGEBB的C代码（称为DGEBB_SSE2_x86，因为我的笔记本电脑是32位x86机器，支持SSE2 - SSE4.1）：

#include <stdint.h>  /* type define of "uintptr_t" */
#include <emmintrin.h>  /* double precision computation support since SSE2 */
#include <R.h>  /* use R's error handling error() */

void DGEBB_SSE2_x86 (int *NB, double *ALPHA, double *A, double *B, double *C) {
/* check "nb", must be a multiple of 4 */
int TWO=2, FOUR=4, nb=*NB; if (nb%FOUR) error("error in DGEBB_SSE2_x86: nb is not a multiple of 4!\n");
/* check memory alignment of A, B, C, 16 Byte alignment is mandatory (as XMM registers are 128-bit in length) */
uintptr_t sixteen_bytes=0xF;
if ((uintptr_t)A & sixteen_bytes) error("error in DGEBB_SSE2_x86: A is not 16 Bytes aligned in memory!");
if ((uintptr_t)B & sixteen_bytes) error("error in DGEBB_SSE2_x86: B is not 16 Bytes aligned in memory!");
if ((uintptr_t)C & sixteen_bytes) error("error in DGEBB_SSE2_x86: C is not 16 Bytes aligned in memory!");
/* define vector variables */
__m128d C1_vec_reg=_mm_setzero_pd(), C2_vec_reg=C1_vec_reg, C3_vec_reg=C1_vec_reg, C4_vec_reg=C1_vec_reg,A1_vec_reg, A2_vec_reg, B_vec_reg, U_vec_reg;
/* define scalar variables */
int jj, kk, ii, nb2=nb+nb, nb_half=nb/TWO;
double *B1_copy, *B1, *C1, *a, *b, *c, *c0;
/* start triple loop nest */
C1=C;B1=B;  /* initial column tile of C and B */
jj=nb_half;
while (jj--) {
  c=C1;B1_copy=B1;C1+=nb2;B1+=nb2;b=B1_copy;
  for (ii=0; ii<nb; ii+=FOUR) {
    a=A+ii;b=B1_copy;
    kk=nb_half;
    while (kk--) {
    /* [kernel] amortize pointer arithmetic! */
    A1_vec_reg=_mm_load_pd(a);  /* [fetch] */
    B_vec_reg=_mm_load1_pd(b);  /* [fetch] */
    U_vec_reg=_mm_mul_pd(A1_vec_reg,B_vec_reg);C1_vec_reg=_mm_add_pd(C1_vec_reg,U_vec_reg);  /* [daxpy] */
    A2_vec_reg=_mm_load_pd(a+TWO);a+=nb;  /* [fetch] */
    U_vec_reg=_mm_mul_pd(A2_vec_reg,B_vec_reg);C2_vec_reg=_mm_add_pd(C2_vec_reg,U_vec_reg);  /* [daxpy] */
    B_vec_reg=_mm_load1_pd(b+nb);b++;  /* [fetch] */
    U_vec_reg=_mm_mul_pd(A1_vec_reg,B_vec_reg);C3_vec_reg=_mm_add_pd(C3_vec_reg,U_vec_reg);  /* [daxpy] */
    A1_vec_reg=_mm_load_pd(a);  /* [fetch] */
    U_vec_reg=_mm_mul_pd(A2_vec_reg,B_vec_reg);C4_vec_reg=_mm_add_pd(C4_vec_reg,U_vec_reg);  /* [daxpy]*/
    B_vec_reg=_mm_load1_pd(b);  /* [fetch] */
    U_vec_reg=_mm_mul_pd(A1_vec_reg,B_vec_reg);C1_vec_reg=_mm_add_pd(C1_vec_reg,U_vec_reg);  /* [daxpy] */
    A2_vec_reg=_mm_load_pd(a+TWO);a+=nb;  /* [fetch] */
    U_vec_reg=_mm_mul_pd(A2_vec_reg,B_vec_reg);C2_vec_reg=_mm_add_pd(C2_vec_reg,U_vec_reg);  /* [daxpy] */
    B_vec_reg=_mm_load1_pd(b+nb);b++;  /* [fetch] */
    U_vec_reg=_mm_mul_pd(A1_vec_reg,B_vec_reg);C3_vec_reg=_mm_add_pd(C3_vec_reg,U_vec_reg);  /* [daxpy] */
    U_vec_reg=_mm_mul_pd(A2_vec_reg,B_vec_reg);C4_vec_reg=_mm_add_pd(C4_vec_reg,U_vec_reg);  /* [daxpy] */
    }  /* [end of kk-loop] */
  /* [write-back] amortize pointer arithmetic! */
  A2_vec_reg=_mm_load1_pd(ALPHA);
  U_vec_reg=_mm_load_pd(c);c0=c+nb;C1_vec_reg=_mm_mul_pd(C1_vec_reg,A2_vec_reg);  /* [fetch] */
  A1_vec_reg=U_vec_reg;C1_vec_reg=_mm_add_pd(C1_vec_reg,A1_vec_reg);U_vec_reg=_mm_load_pd(c0);  /* [fetch] */
  C3_vec_reg=_mm_mul_pd(C3_vec_reg,A2_vec_reg);_mm_store_pd(c,C1_vec_reg);c+=TWO;  /* [store] */
  A1_vec_reg=U_vec_reg;C3_vec_reg=_mm_add_pd(C3_vec_reg,A1_vec_reg);U_vec_reg=_mm_load_pd(c);  /* [fetch] */
  C2_vec_reg=_mm_mul_pd(C2_vec_reg,A2_vec_reg);_mm_store_pd(c0,C3_vec_reg);c0+=TWO;  /* [store] */
  A1_vec_reg=U_vec_reg;C2_vec_reg=_mm_add_pd(C2_vec_reg,A1_vec_reg);U_vec_reg=_mm_load_pd(c0);  /* [fetch] */
  C4_vec_reg=_mm_mul_pd(C4_vec_reg,A2_vec_reg);_mm_store_pd(c,C2_vec_reg);c+=TWO;  /* [store] */
  C4_vec_reg=_mm_add_pd(C4_vec_reg,U_vec_reg);_mm_store_pd(c0,C4_vec_reg);  /* [store] */
  C1_vec_reg=_mm_setzero_pd();C3_vec_reg=C1_vec_reg;C2_vec_reg=C1_vec_reg;C4_vec_reg=C1_vec_reg;
  }  /* [end of ii-loop] */
}  /* [end of jj-loop] */
}

My stupid version of inline assembly for the kk-loop is here: 我的kk-loop内联汇编的愚蠢版本在这里：

      while (kk--) {
    asm("movapd %0, %%xmm3\n\t"     /* C1_vec_reg -> xmm3 */
        "movapd %1, %%xmm1\n\t"     /* C2_vec_reg -> xmm1 */
        "movapd %2, %%xmm2\n\t"     /* C3_vec_reg -> xmm2 */
        "movapd %3, %%xmm0\n\t"     /* C4_vec_reg -> xmm0 */
        "movl %4, %%eax\n\t"    /* pointer a -> %eax */
        "movl %5, %%edx\n\t"    /* pointer b -> %edx */
        "movl %6, %%ecx\n\t"    /* block size nb -> %ecx */
        "movapd (%%eax), %%xmm5\n\t"   /* A1_vec_reg -> xmm5 */
    "movsd (%%edx), %%xmm4\n\t"        /* B_vec_reg -> xmm4 */
    "unpcklpd %%xmm4, %%xmm4\n\t"
        "movapd %%xmm5, %%xmm6\n\t"        /* xmm5 -> xmm6 */
        "mulpd %%xmm4, %%xmm6\n\t"        /* xmm6 *= xmm4 */
    "addpd %%xmm6, %%xmm3\n\t"        /* xmm3 += xmm6 */
        "movapd 16(%%eax), %%xmm7\n\t"        /* A2_vec_reg -> xmm7 */
        "movapd %%xmm7, %%xmm6\n\t"        /* xmm7 -> xmm6 */
        "mulpd %%xmm4, %%xmm6\n\t"        /* xmm6 *= xmm4 */
    "addpd %%xmm6, %%xmm1\n\t"        /* xmm1 += xmm6 */
    "movsd (%%edx,%%ecx), %%xmm4\n\t"        /* B_vec_reg -> xmm4 */
    "addl $8, %%edx\n\t"          /* b++ */
    "movsd (%%edx), %%xmm4\n\t"       /* B_vec_reg -> xmm4 */
    "unpcklpd %%xmm4, %%xmm4\n\t"
        "movapd %%xmm5, %%xmm6\n\t"        /* xmm5 -> xmm6 */
        "mulpd %%xmm4, %%xmm6\n\t"        /* xmm6 *= xmm4 */
    "addpd %%xmm6, %%xmm2\n\t"        /* xmm2 += xmm6 */
    "addl %%ecx, %%eax\n\t"          /* a+=nb */
    "movapd (%%eax), %%xmm5\n\t"        /* A1_vec_reg -> xmm5 */
        "movapd %%xmm7, %%xmm6\n\t"        /* xmm7 -> xmm6 */
        "mulpd %%xmm4, %%xmm6\n\t"        /* xmm6 *= xmm4 */
    "addpd %%xmm6, %%xmm0\n\t"      /* xmm0 += xmm6 */
    "movsd (%%edx), %%xmm4\n\t"        /* B_vec_reg -> xmm4 */
    "unpcklpd %%xmm4, %%xmm4\n\t"
        "movapd %%xmm5, %%xmm6\n\t"        /* xmm5 -> xmm6 */
        "mulpd %%xmm4, %%xmm6\n\t"        /* xmm6 *= xmm4 */
        "addpd %%xmm6, %%xmm3\n\t"        /* xmm3 += xmm6 */
        "movapd 16(%%eax), %%xmm7\n\t"        /* A2_vec_reg -> xmm7 */
        "movapd %%xmm7, %%xmm6\n\t"        /* xmm7 -> xmm6 */
        "mulpd %%xmm4, %%xmm6\n\t"        /* xmm6 *= xmm4 */
    "addpd %%xmm6, %%xmm1\n\t"        /* xmm1 += xmm6 */
    "movsd (%%edx,%%ecx), %%xmm4\n\t"        /* B_vec_reg -> xmm4 */
    "addl $8, %%edx\n\t"          /* b++ */
    "movsd (%%edx), %%xmm4\n\t"       /* B_vec_reg -> xmm4 */
    "unpcklpd %%xmm4, %%xmm4\n\t"
        "movapd %%xmm5, %%xmm6\n\t"        /* xmm5 -> xmm6 */
        "mulpd %%xmm4, %%xmm6\n\t"        /* xmm6 *= xmm4 */
    "addpd %%xmm6, %%xmm2\n\t"        /* xmm2 += xmm6 */
        "movapd %%xmm7, %%xmm6\n\t"        /* xmm7 -> xmm6 */
        "mulpd %%xmm4, %%xmm6\n\t"        /* xmm6 *= xmm4 */
    "addpd %%xmm6, %%xmm0\n\t"      /* xmm0 += xmm6 */
    "addl %%ecx, %%eax"
        : "+x"(C1_vec_reg), "+x"(C2_vec_reg), "+x"(C3_vec_reg), "+x"(C4_vec_reg), "+m"(a), "+m"(b)
        : "x"(C1_vec_reg), "x"(C2_vec_reg), "x"(C3_vec_reg), "x"(C4_vec_reg), "4"(a), "5"(b), "rm"(nb)); 
}

Here is some explanation of the code: 以下是代码的一些解释：

Unrolling out loops to expose a micro "dger" kernel for register resue:
 (c11 c12) += (a1) * (b1 b2)
 (c21 c22)    (a2)
 (c31 c32)    (a3)
 (c41 c42)    (a4)
This can be implemented as 4 vectorized "daxpy":
 (c11) += (a1) * (b1)  ,  (c31) += (a3) * (b1)  ,  (c12) += (a1) * (b2)  ,  (c32) += (a3) * (b2)  .
 (c21)    (a2)   (b1)     (c41)    (a4)   (b1)     (c22)    (a2)   (b2)     (c42)    (a4)   (b2)
4 micor C-vectors are held constantly in XMM registers named C1_vec_reg, C2_vec_reg, C3_vec_reg, C4_vec_reg.
2 micro A-vectors are loaded into XMM registers named A1_vec_reg, A2_vec_reg.
2 micro B-vectors can reuse a single XMM register named B_vec_reg.
1 additional XMM register, U_vec_reg, will store temporary values.
The above scheduling exploits all 8 XMM registers on x84 architectures with SIMD unit, and each XMM is used twice after loaded.

PS: I am an R user from stats group. PS：我是stats组的R用户。 The header file enables the use of R's error handling functionality error(). 头文件允许使用R的错误处理功能error（）。 This will just terminate C program rather than the whole R process. 这将终止C程序而不是整个R程序。 If you do not use R, delete this line and corresponding lines in the code. 如果不使用R，请删除此行以及代码中的相应行。

Answer 1

This is an old problem back to the early phase of development of my HPC Cholesky factorization routine. 回到HPC Cholesky分解程序开发的早期阶段，这是一个老问题。 The C code is outdated, and the assembly is naively incorrect. C代码已过时，程序集天真不正确。 Later posts follow this thread. 后来的帖子跟随这个帖子。

(inline assembly in C) Assembler messages: Error: unknown pseudo-op: gives a correct implementation of inline assembly. （C中的内联汇编）汇编程序消息：错误：未知伪操作：提供内联汇编的正确实现。

How to ask GCC to completely unroll this loop (ie, peel this loop)? 如何让GCC完全展开这个循环（即剥离这个循环）？ gives better C code. 提供更好的C代码。

When writing GCC inline assembly, cares need to paid to potential changes of status flag. 在编写GCC内联汇编时，需要注意可能的状态标志更改。 (inline assembly in C) Funny memory segmentation fault is a lesson for me. （C中的内联汇编）有趣的内存分段错误对我来说是一个教训。

Vectorization is key to HPC. 矢量化是HPC的关键。 SSE instruction MOVSD (extended: floating point scalar & vector operations on x86, x86-64) contains some discussion on Intel SSE2/3, while FMA instruction _mm256_fmadd_pd(): "132", "231" and "213"? SSE指令MOVSD（扩展：x86上的浮点标量和向量运算，x86-64）包含对英特尔SSE2 / 3的一些讨论，而FMA指令_mm256_fmadd_pd（）：“132”，“231”和“213”？ has some information on Intel AVX's FMA instruction. 有关于英特尔AVX的FMA指令的一些信息。

Surely all these are only related to computational kernels. 当然所有这些只与计算内核有关。 There are a lot of other work related to how everything are wrapped up for a final high performance Cholesky factorization routine. 还有很多其他工作与最终高性能Cholesky分解例程的所有内容相关。 The performance of the first release of my routine is in Why can't my CPU maintain peak performance in HPC . 我的例程的第一个版本的性能是为什么我的CPU不能在HPC中保持最佳性能。

Currently I am upgrading the kernel routine for even higher performance. 目前我正在升级内核例程以获得更高的性能。 Possibly there will be further posts on this thread. 可能会在这个帖子上有更多的帖子。 Thanks to stack overflow community, especially Z boson , Peter Cordes and nominal animal for answering various my questions. 感谢堆栈溢出社区，特别是Z boson ， Peter Cordes和名义上的动物，以回答我的各种问题。 I learnt a lot and feel really happy in this process. 我学到了很多东西，在这个过程中感到很开心。 [Surely at the same time, I learnt to be a better SO member.] [当然，与此同时，我学会了成为更好的SO成员。]

使用内联汇编来加速矩阵乘法

问题描述

1 个解决方案

解决方案1
3 已采纳 2016-04-04 00:03:37

使用内联汇编来加速矩阵乘法

问题描述

1 个解决方案

解决方案1 3 已采纳 2016-04-04 00:03:37

解决方案1
3 已采纳 2016-04-04 00:03:37