了解SSE指令的矢量化

Question

I try to understand how vectorization with SSE instructions works. 我试着理解SSE指令的矢量化是如何工作的。

Here a code snippet where vectorization is achieved : 这里是一个实现矢量化的代码片段：

#include <stdlib.h>
#include <stdio.h>

#define SIZE 10000

void test1(double * restrict a, double * restrict b)
{
  int i;

  double *x = __builtin_assume_aligned(a, 16);
  double *y = __builtin_assume_aligned(b, 16);

  for (i = 0; i < SIZE; i++)
  {
    x[i] += y[i];
  }
}

and my compilation command : 和我的编译命令：

gcc -std=c99 -c example1.c -O3 -S -o example1.s

Here the output for assembler code : 这里是汇编程序代码的输出：

 .file "example1.c"
  .text
  .p2align 4,,15
  .globl  test1
  .type test1, @function
test1:
.LFB7:
  .cfi_startproc
  xorl  %eax, %eax
  .p2align 4,,10
  .p2align 3
.L3:
  movapd  (%rdi,%rax), %xmm0
  addpd (%rsi,%rax), %xmm0
  movapd  %xmm0, (%rdi,%rax)
  addq  $16, %rax
  cmpq  $80000, %rax
  jne .L3
  rep ret
  .cfi_endproc
.LFE7:
  .size test1, .-test1
  .ident  "GCC: (Debian 4.8.2-16) 4.8.2"
  .section  .note.GNU-stack,"",@progbits

I have practiced Assembler many years ago and I would like to know what represents above the registers %rdi, %rax and %rsi. 我多年前练习过Assembler，我想知道寄存器％rdi，％rax和％rsi之上的代码。

I know %xmm0 is the SIMD register where we can store 2 doubles (on 16 bytes). 我知道％xmm0是SIMD寄存器，我们可以存储2个双精度数（16个字节）。

But I don't understand how the simultaneous addition is performed : 但我不明白如何同时添加：

I think all happens here : 我想这一切都发生在这里：

      movapd  (%rdi,%rax), %xmm0
      addpd (%rsi,%rax), %xmm0
      movapd  %xmm0, (%rdi,%rax)
      addq  $16, %rax
      cmpq  $80000, %rax
      jne .L3
      rep ret

Does %rax represents "x" array ? ％rax是否代表“x”数组？

What does %rsi represent in C code snippet ? ％rsi在C代码段中代表什么？

Does the final result (for example a[0]=a[0]+b[0] is stored into %rdi ? 最终结果（例如a [0] = a [0] + b [0]）是否存储在％rdi中？

Thanks for your help 谢谢你的帮助

Answer 1

The first thing you need to know is the calling conventions for 64-bit code on Unix systems. 您需要知道的第一件事是Unix系统上64位代码的调用约定。 See Wikipedia's x86-64_calling_conventions and for much more detail read Agner Fog's calling conventions manual . 请参阅Wikipedia的x86-64_calling_conventions ，有关更多详细信息，请阅读Agner Fog的调用约定手册。

Integer parameters are passed in the following order: rdi, rsi, rdx, rcx, r8, r9. 整数参数按以下顺序传递：rdi，rsi，rdx，rcx，r8，r9。 So you can pass up six integer values by register (but only four on Windows). 因此，您可以通过寄存器传递六个整数值（但在Windows上只能传递四个）。 This means in your case that: 这意味着在您的情况下：

rdi = &x[0],
rsi = &y[0].

The register rax starts at zero and increments 2*sizeof(double)=16 bytes each iteration. 寄存器rax从零开始，每次迭代增加2*sizeof(double)=16字节。 It is then compared with sizeof(double)*10000=80000 each iteration to test if the loop is finished. 然后将其与每次迭代的sizeof(double)*10000=80000进行比较，以测试循环是否结束。

The use of cmp here is actually an inefficiency in the GCC compiler. 这里使用cmp实际上是GCC编译器的低效率。 Modern Intel processors can fuse the cmp and jne instruction into one instruction and they can also fuse add and jne into one instruction but they cannot fuse add , cmp , and jne into one instruction. 现代英特尔处理器可以将cmp和jne指令融合到一条指令中，它们也可以将add和jne融合到一条指令中，但它们不能将add ， cmp和jne融合到一条指令中。 But it's possible to remove the cmp instruction . 但是可以删除cmp指令。

What GCC should have done is set GCC应该做些什么

rdi = &x[0] + 80000;
rsi = &y[0] + 80000;
rax = -80000

and then the loop could be done like this 然后循环可以像这样完成

movapd  (%rdi,%rax), %xmm0       ; temp = x[i]
addpd (%rsi,%rax), %xmm0         ; temp += y[i]
movapd  %xmm0, (%rdi,%rax)       ; x[i] = temp
addq  $16, %rax                  ; i += 2
jnz .L3                          ; then loop

Now the loop counts from -80000 up to 0 and does not need the cmp instruction and the add and jnz will be fused into one micro-operation. 现在循环计数从-80000到0并且不需要cmp指令， add和jnz将融合到一个微操作中。

Answer 2

  movapd  (%rdi,%rax), %xmm0       ; temp = x[i]
  addpd (%rsi,%rax), %xmm0         ; temp += y[i]
  movapd  %xmm0, (%rdi,%rax)       ; x[i] = temp
  addq  $16, %rax                  ; i += 2
  cmpq  $80000, %rax               ; if (i < SIZE)
  jne .L3                          ; then loop

The rax register represents the i variable, but stored as a byte index, rdi is &x, rsi is &y. rax寄存器表示i变量，但存储为字节索引，rdi是＆x，rsi是＆y。 Each pass through the loop adds two doubles, thus the increment of rax by 2 * sizeof(double) or 16 bytes. 每次通过循环都会增加两个双精度数，因此rax的增量为2 * sizeof（double）或16个字节。

了解SSE指令的矢量化

问题描述

2 个解决方案

解决方案1
5 已采纳 2014-10-15 07:57:38

解决方案2
3 2014-10-14 22:04:52

了解SSE指令的矢量化

问题描述

2 个解决方案

解决方案1 5 已采纳 2014-10-15 07:57:38

解决方案2 3 2014-10-14 22:04:52

解决方案1
5 已采纳 2014-10-15 07:57:38

解决方案2
3 2014-10-14 22:04:52