简体   繁体   English

使用内联汇编来加速矩阵乘法

[英]Using inline assembly to speed up Matrix multiplication

I have been trying to speed up matrix-matrix multiplication C <- C + alpha * A * B via register blocking, SSE2 vectorization and L1 cache blocking (note that I have specially chosen the transpose setting op(A)=A and op(B)=B). 我一直试图通过寄存器阻塞,SSE2矢量化和L1缓存阻塞来加速矩阵 - 矩阵乘法C < - C + alpha * A * B(注意我已经特别选择了转置设置op(A)= A和op( B)= B)。 After some effort my written code is still about 50% slower than GotoBLAS in single thread mode . 经过一些努力,我的书面代码在单线程模式下比GotoBLAS慢约50%

The following is my code for the "kernel" square matrix-matrix multiplication on L1 cache, called "DGEBB" (general block-block operation) in Goto's work, that multiplies two NB*NB square matrices (NB restricted to be a multiple of 4). 下面是我在L1缓存中的“内核”方矩阵 - 矩阵乘法的代码,在Goto的工作中称为“DGEBB”(一般块 - 块操作),它将两个NB * NB平方矩阵相乘(NB限制为倍数的倍数) 4)。 I have examined its assembly output under GCC 4.8, realizing that the compiler is not doing a good job in scheduling the unrolled innermost loop: kk-loop. 我已经在GCC 4.8下检查了它的汇编输出,意识到编译器在调度展开的最内层循环方面做得不好:kk-loop。 What I hope is that the compiler optimizes register allocation to attain register reuse, and schedules the computation interleaving multiplication, addition and memory operation for pipelining; 我希望编译器优化寄存器分配以实现寄存器重用,并为流水线调度计算交错乘法,加法和存储操作; however, the compiler failed to do this. 但是,编译器无法执行此操作。 For this reason, I would like to replace the innermost loop by some inline assembly . 出于这个原因,我想用一些内联汇编替换最里面的循环

I am completely new to x86 assembly. 我是x86程序集的新手。 Though having read around for GCC's extended asm for hours, I am still not sure how to do it properly. 虽然已经读了几个小时的GCC 扩展的asm ,但我仍然不确定如何正确地做到这一点。 I have attached a stupid version I could write at my best, yet knowing it is wrong. 我附上了一个我能写得最好的愚蠢版本,但我知道这是错的。 This version is modified from the compiler's original assembly output for the kk-loop. 此版本是从编译器的kk-loop原始程序集输出中修改的。 As I know how to allocate register using "movl", "movapd", etc, I have re-arranged the computation in the order I fancy. 我知道如何使用“movl”,“movapd”等分配寄存器,我按照我喜欢的顺序重新安排了计算。 But It does not work yet. 但它还没有奏效。 1) It seems to me that registers %eax, %ebx, %ecx are used both inside and outside the assembly which is nasty. 1)在我看来,寄存器%eax,%ebx,%ecx在组件的内部和外部使用都是令人讨厌的。 2) Also, the way I pass the input and output operands does not work. 2)此外,我传递输入和输出操作数的方式不起作用。 3) Finally, I really want a version that the whole kk-loop can be inlined. 3)最后,我真的想要一个可以内联整个kk循环的版本。 Thanks if someone could helps me out! 谢谢,如果有人可以帮助我!

The C code for DGEBB (called DGEBB_SSE2_x86, as my laptop is 32-bit x86 machine, with SSE2 - SSE4.1 support): DGEBB的C代码(称为DGEBB_SSE2_x86,因为我的笔记本电脑是32位x86机器,支持SSE2 - SSE4.1):

#include <stdint.h>  /* type define of "uintptr_t" */
#include <emmintrin.h>  /* double precision computation support since SSE2 */
#include <R.h>  /* use R's error handling error() */

void DGEBB_SSE2_x86 (int *NB, double *ALPHA, double *A, double *B, double *C) {
/* check "nb", must be a multiple of 4 */
int TWO=2, FOUR=4, nb=*NB; if (nb%FOUR) error("error in DGEBB_SSE2_x86: nb is not a multiple of 4!\n");
/* check memory alignment of A, B, C, 16 Byte alignment is mandatory (as XMM registers are 128-bit in length) */
uintptr_t sixteen_bytes=0xF;
if ((uintptr_t)A & sixteen_bytes) error("error in DGEBB_SSE2_x86: A is not 16 Bytes aligned in memory!");
if ((uintptr_t)B & sixteen_bytes) error("error in DGEBB_SSE2_x86: B is not 16 Bytes aligned in memory!");
if ((uintptr_t)C & sixteen_bytes) error("error in DGEBB_SSE2_x86: C is not 16 Bytes aligned in memory!");
/* define vector variables */
__m128d C1_vec_reg=_mm_setzero_pd(), C2_vec_reg=C1_vec_reg, C3_vec_reg=C1_vec_reg, C4_vec_reg=C1_vec_reg,A1_vec_reg, A2_vec_reg, B_vec_reg, U_vec_reg;
/* define scalar variables */
int jj, kk, ii, nb2=nb+nb, nb_half=nb/TWO;
double *B1_copy, *B1, *C1, *a, *b, *c, *c0;
/* start triple loop nest */
C1=C;B1=B;  /* initial column tile of C and B */
jj=nb_half;
while (jj--) {
  c=C1;B1_copy=B1;C1+=nb2;B1+=nb2;b=B1_copy;
  for (ii=0; ii<nb; ii+=FOUR) {
    a=A+ii;b=B1_copy;
    kk=nb_half;
    while (kk--) {
    /* [kernel] amortize pointer arithmetic! */
    A1_vec_reg=_mm_load_pd(a);  /* [fetch] */
    B_vec_reg=_mm_load1_pd(b);  /* [fetch] */
    U_vec_reg=_mm_mul_pd(A1_vec_reg,B_vec_reg);C1_vec_reg=_mm_add_pd(C1_vec_reg,U_vec_reg);  /* [daxpy] */
    A2_vec_reg=_mm_load_pd(a+TWO);a+=nb;  /* [fetch] */
    U_vec_reg=_mm_mul_pd(A2_vec_reg,B_vec_reg);C2_vec_reg=_mm_add_pd(C2_vec_reg,U_vec_reg);  /* [daxpy] */
    B_vec_reg=_mm_load1_pd(b+nb);b++;  /* [fetch] */
    U_vec_reg=_mm_mul_pd(A1_vec_reg,B_vec_reg);C3_vec_reg=_mm_add_pd(C3_vec_reg,U_vec_reg);  /* [daxpy] */
    A1_vec_reg=_mm_load_pd(a);  /* [fetch] */
    U_vec_reg=_mm_mul_pd(A2_vec_reg,B_vec_reg);C4_vec_reg=_mm_add_pd(C4_vec_reg,U_vec_reg);  /* [daxpy]*/
    B_vec_reg=_mm_load1_pd(b);  /* [fetch] */
    U_vec_reg=_mm_mul_pd(A1_vec_reg,B_vec_reg);C1_vec_reg=_mm_add_pd(C1_vec_reg,U_vec_reg);  /* [daxpy] */
    A2_vec_reg=_mm_load_pd(a+TWO);a+=nb;  /* [fetch] */
    U_vec_reg=_mm_mul_pd(A2_vec_reg,B_vec_reg);C2_vec_reg=_mm_add_pd(C2_vec_reg,U_vec_reg);  /* [daxpy] */
    B_vec_reg=_mm_load1_pd(b+nb);b++;  /* [fetch] */
    U_vec_reg=_mm_mul_pd(A1_vec_reg,B_vec_reg);C3_vec_reg=_mm_add_pd(C3_vec_reg,U_vec_reg);  /* [daxpy] */
    U_vec_reg=_mm_mul_pd(A2_vec_reg,B_vec_reg);C4_vec_reg=_mm_add_pd(C4_vec_reg,U_vec_reg);  /* [daxpy] */
    }  /* [end of kk-loop] */
  /* [write-back] amortize pointer arithmetic! */
  A2_vec_reg=_mm_load1_pd(ALPHA);
  U_vec_reg=_mm_load_pd(c);c0=c+nb;C1_vec_reg=_mm_mul_pd(C1_vec_reg,A2_vec_reg);  /* [fetch] */
  A1_vec_reg=U_vec_reg;C1_vec_reg=_mm_add_pd(C1_vec_reg,A1_vec_reg);U_vec_reg=_mm_load_pd(c0);  /* [fetch] */
  C3_vec_reg=_mm_mul_pd(C3_vec_reg,A2_vec_reg);_mm_store_pd(c,C1_vec_reg);c+=TWO;  /* [store] */
  A1_vec_reg=U_vec_reg;C3_vec_reg=_mm_add_pd(C3_vec_reg,A1_vec_reg);U_vec_reg=_mm_load_pd(c);  /* [fetch] */
  C2_vec_reg=_mm_mul_pd(C2_vec_reg,A2_vec_reg);_mm_store_pd(c0,C3_vec_reg);c0+=TWO;  /* [store] */
  A1_vec_reg=U_vec_reg;C2_vec_reg=_mm_add_pd(C2_vec_reg,A1_vec_reg);U_vec_reg=_mm_load_pd(c0);  /* [fetch] */
  C4_vec_reg=_mm_mul_pd(C4_vec_reg,A2_vec_reg);_mm_store_pd(c,C2_vec_reg);c+=TWO;  /* [store] */
  C4_vec_reg=_mm_add_pd(C4_vec_reg,U_vec_reg);_mm_store_pd(c0,C4_vec_reg);  /* [store] */
  C1_vec_reg=_mm_setzero_pd();C3_vec_reg=C1_vec_reg;C2_vec_reg=C1_vec_reg;C4_vec_reg=C1_vec_reg;
  }  /* [end of ii-loop] */
}  /* [end of jj-loop] */
}

My stupid version of inline assembly for the kk-loop is here: 我的kk-loop内联汇编的愚蠢版本在这里:

      while (kk--) {
    asm("movapd %0, %%xmm3\n\t"     /* C1_vec_reg -> xmm3 */
        "movapd %1, %%xmm1\n\t"     /* C2_vec_reg -> xmm1 */
        "movapd %2, %%xmm2\n\t"     /* C3_vec_reg -> xmm2 */
        "movapd %3, %%xmm0\n\t"     /* C4_vec_reg -> xmm0 */
        "movl %4, %%eax\n\t"    /* pointer a -> %eax */
        "movl %5, %%edx\n\t"    /* pointer b -> %edx */
        "movl %6, %%ecx\n\t"    /* block size nb -> %ecx */
        "movapd (%%eax), %%xmm5\n\t"   /* A1_vec_reg -> xmm5 */
    "movsd (%%edx), %%xmm4\n\t"        /* B_vec_reg -> xmm4 */
    "unpcklpd %%xmm4, %%xmm4\n\t"
        "movapd %%xmm5, %%xmm6\n\t"        /* xmm5 -> xmm6 */
        "mulpd %%xmm4, %%xmm6\n\t"        /* xmm6 *= xmm4 */
    "addpd %%xmm6, %%xmm3\n\t"        /* xmm3 += xmm6 */
        "movapd 16(%%eax), %%xmm7\n\t"        /* A2_vec_reg -> xmm7 */
        "movapd %%xmm7, %%xmm6\n\t"        /* xmm7 -> xmm6 */
        "mulpd %%xmm4, %%xmm6\n\t"        /* xmm6 *= xmm4 */
    "addpd %%xmm6, %%xmm1\n\t"        /* xmm1 += xmm6 */
    "movsd (%%edx,%%ecx), %%xmm4\n\t"        /* B_vec_reg -> xmm4 */
    "addl $8, %%edx\n\t"          /* b++ */
    "movsd (%%edx), %%xmm4\n\t"       /* B_vec_reg -> xmm4 */
    "unpcklpd %%xmm4, %%xmm4\n\t"
        "movapd %%xmm5, %%xmm6\n\t"        /* xmm5 -> xmm6 */
        "mulpd %%xmm4, %%xmm6\n\t"        /* xmm6 *= xmm4 */
    "addpd %%xmm6, %%xmm2\n\t"        /* xmm2 += xmm6 */
    "addl %%ecx, %%eax\n\t"          /* a+=nb */
    "movapd (%%eax), %%xmm5\n\t"        /* A1_vec_reg -> xmm5 */
        "movapd %%xmm7, %%xmm6\n\t"        /* xmm7 -> xmm6 */
        "mulpd %%xmm4, %%xmm6\n\t"        /* xmm6 *= xmm4 */
    "addpd %%xmm6, %%xmm0\n\t"      /* xmm0 += xmm6 */
    "movsd (%%edx), %%xmm4\n\t"        /* B_vec_reg -> xmm4 */
    "unpcklpd %%xmm4, %%xmm4\n\t"
        "movapd %%xmm5, %%xmm6\n\t"        /* xmm5 -> xmm6 */
        "mulpd %%xmm4, %%xmm6\n\t"        /* xmm6 *= xmm4 */
        "addpd %%xmm6, %%xmm3\n\t"        /* xmm3 += xmm6 */
        "movapd 16(%%eax), %%xmm7\n\t"        /* A2_vec_reg -> xmm7 */
        "movapd %%xmm7, %%xmm6\n\t"        /* xmm7 -> xmm6 */
        "mulpd %%xmm4, %%xmm6\n\t"        /* xmm6 *= xmm4 */
    "addpd %%xmm6, %%xmm1\n\t"        /* xmm1 += xmm6 */
    "movsd (%%edx,%%ecx), %%xmm4\n\t"        /* B_vec_reg -> xmm4 */
    "addl $8, %%edx\n\t"          /* b++ */
    "movsd (%%edx), %%xmm4\n\t"       /* B_vec_reg -> xmm4 */
    "unpcklpd %%xmm4, %%xmm4\n\t"
        "movapd %%xmm5, %%xmm6\n\t"        /* xmm5 -> xmm6 */
        "mulpd %%xmm4, %%xmm6\n\t"        /* xmm6 *= xmm4 */
    "addpd %%xmm6, %%xmm2\n\t"        /* xmm2 += xmm6 */
        "movapd %%xmm7, %%xmm6\n\t"        /* xmm7 -> xmm6 */
        "mulpd %%xmm4, %%xmm6\n\t"        /* xmm6 *= xmm4 */
    "addpd %%xmm6, %%xmm0\n\t"      /* xmm0 += xmm6 */
    "addl %%ecx, %%eax"
        : "+x"(C1_vec_reg), "+x"(C2_vec_reg), "+x"(C3_vec_reg), "+x"(C4_vec_reg), "+m"(a), "+m"(b)
        : "x"(C1_vec_reg), "x"(C2_vec_reg), "x"(C3_vec_reg), "x"(C4_vec_reg), "4"(a), "5"(b), "rm"(nb)); 
}

Here is some explanation of the code: 以下是代码的一些解释:

Unrolling out loops to expose a micro "dger" kernel for register resue:
 (c11 c12) += (a1) * (b1 b2)
 (c21 c22)    (a2)
 (c31 c32)    (a3)
 (c41 c42)    (a4)
This can be implemented as 4 vectorized "daxpy":
 (c11) += (a1) * (b1)  ,  (c31) += (a3) * (b1)  ,  (c12) += (a1) * (b2)  ,  (c32) += (a3) * (b2)  .
 (c21)    (a2)   (b1)     (c41)    (a4)   (b1)     (c22)    (a2)   (b2)     (c42)    (a4)   (b2)
4 micor C-vectors are held constantly in XMM registers named C1_vec_reg, C2_vec_reg, C3_vec_reg, C4_vec_reg.
2 micro A-vectors are loaded into XMM registers named A1_vec_reg, A2_vec_reg.
2 micro B-vectors can reuse a single XMM register named B_vec_reg.
1 additional XMM register, U_vec_reg, will store temporary values.
The above scheduling exploits all 8 XMM registers on x84 architectures with SIMD unit, and each XMM is used twice after loaded.

PS: I am an R user from stats group. PS:我是stats组的R用户。 The header file enables the use of R's error handling functionality error(). 头文件允许使用R的错误处理功能error()。 This will just terminate C program rather than the whole R process. 这将终止C程序而不是整个R程序。 If you do not use R, delete this line and corresponding lines in the code. 如果不使用R,请删除此行以及代码中的相应行。

This is an old problem back to the early phase of development of my HPC Cholesky factorization routine. 回到HPC Cholesky分解程序开发的早期阶段,这是一个老问题。 The C code is outdated, and the assembly is naively incorrect. C代码已过时,程序集天真不正确。 Later posts follow this thread. 后来的帖子跟随这个帖子。

(inline assembly in C) Assembler messages: Error: unknown pseudo-op: gives a correct implementation of inline assembly. (C中的内联汇编)汇编程序消息:错误:未知伪操作:提供内联汇编的正确实现。

How to ask GCC to completely unroll this loop (ie, peel this loop)? 如何让GCC完全展开这个循环(即剥离这个循环)? gives better C code. 提供更好的C代码。

When writing GCC inline assembly, cares need to paid to potential changes of status flag. 在编写GCC内联汇编时,需要注意可能的状态标志更改。 (inline assembly in C) Funny memory segmentation fault is a lesson for me. (C中的内联汇编)有趣的内存分段错误对我来说是一个教训。

Vectorization is key to HPC. 矢量化是HPC的关键。 SSE instruction MOVSD (extended: floating point scalar & vector operations on x86, x86-64) contains some discussion on Intel SSE2/3, while FMA instruction _mm256_fmadd_pd(): "132", "231" and "213"? SSE指令MOVSD(扩展:x86上的浮点标量和向量运算,x86-64)包含对英特尔SSE2 / 3的一些讨论,而FMA指令_mm256_fmadd_pd():“132”,“231”和“213”? has some information on Intel AVX's FMA instruction. 有关于英特尔AVX的FMA指令的一些信息。

Surely all these are only related to computational kernels. 当然所有这些只与计算内核有关。 There are a lot of other work related to how everything are wrapped up for a final high performance Cholesky factorization routine. 还有很多其他工作与最终高性能Cholesky分解例程的所有内容相关。 The performance of the first release of my routine is in Why can't my CPU maintain peak performance in HPC . 我的例程的第一个版本的性能是为什么我的CPU不能在HPC中保持最佳性能

Currently I am upgrading the kernel routine for even higher performance. 目前我正在升级内核例程以获得更高的性能。 Possibly there will be further posts on this thread. 可能会在这个帖子上有更多的帖子。 Thanks to stack overflow community, especially Z boson , Peter Cordes and nominal animal for answering various my questions. 感谢堆栈溢出社区,特别是Z bosonPeter Cordes名义上的动物,以回答我的各种问题。 I learnt a lot and feel really happy in this process. 我学到了很多东西,在这个过程中感到很开心。 [Surely at the same time, I learnt to be a better SO member.] [当然,与此同时,我学会了成为更好的SO成员。]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM