[英]Understanding of vectorization with SSE instructions
I try to understand how vectorization with SSE instructions works. 我试着理解SSE指令的矢量化是如何工作的。
Here a code snippet where vectorization is achieved : 这里是一个实现矢量化的代码片段:
#include <stdlib.h>
#include <stdio.h>
#define SIZE 10000
void test1(double * restrict a, double * restrict b)
{
int i;
double *x = __builtin_assume_aligned(a, 16);
double *y = __builtin_assume_aligned(b, 16);
for (i = 0; i < SIZE; i++)
{
x[i] += y[i];
}
}
and my compilation command : 和我的编译命令:
gcc -std=c99 -c example1.c -O3 -S -o example1.s
Here the output for assembler code : 这里是汇编程序代码的输出:
.file "example1.c"
.text
.p2align 4,,15
.globl test1
.type test1, @function
test1:
.LFB7:
.cfi_startproc
xorl %eax, %eax
.p2align 4,,10
.p2align 3
.L3:
movapd (%rdi,%rax), %xmm0
addpd (%rsi,%rax), %xmm0
movapd %xmm0, (%rdi,%rax)
addq $16, %rax
cmpq $80000, %rax
jne .L3
rep ret
.cfi_endproc
.LFE7:
.size test1, .-test1
.ident "GCC: (Debian 4.8.2-16) 4.8.2"
.section .note.GNU-stack,"",@progbits
I have practiced Assembler many years ago and I would like to know what represents above the registers %rdi, %rax and %rsi. 我多年前练习过Assembler,我想知道寄存器%rdi,%rax和%rsi之上的代码。
I know %xmm0 is the SIMD register where we can store 2 doubles (on 16 bytes). 我知道%xmm0是SIMD寄存器,我们可以存储2个双精度数(16个字节)。
But I don't understand how the simultaneous addition is performed : 但我不明白如何同时添加:
I think all happens here : 我想这一切都发生在这里:
movapd (%rdi,%rax), %xmm0
addpd (%rsi,%rax), %xmm0
movapd %xmm0, (%rdi,%rax)
addq $16, %rax
cmpq $80000, %rax
jne .L3
rep ret
Does %rax represents "x" array ? %rax是否代表“x”数组?
What does %rsi represent in C code snippet ? %rsi在C代码段中代表什么?
Does the final result (for example a[0]=a[0]+b[0] is stored into %rdi ? 最终结果(例如a [0] = a [0] + b [0])是否存储在%rdi中?
Thanks for your help 谢谢你的帮助
The first thing you need to know is the calling conventions for 64-bit code on Unix systems. 您需要知道的第一件事是Unix系统上64位代码的调用约定。 See Wikipedia's x86-64_calling_conventions and for much more detail read Agner Fog's calling conventions manual . 请参阅Wikipedia的x86-64_calling_conventions ,有关更多详细信息,请阅读Agner Fog的调用约定手册 。
Integer parameters are passed in the following order: rdi, rsi, rdx, rcx, r8, r9. 整数参数按以下顺序传递:rdi,rsi,rdx,rcx,r8,r9。 So you can pass up six integer values by register (but only four on Windows). 因此,您可以通过寄存器传递六个整数值(但在Windows上只能传递四个)。 This means in your case that: 这意味着在您的情况下:
rdi = &x[0],
rsi = &y[0].
The register rax
starts at zero and increments 2*sizeof(double)=16
bytes each iteration. 寄存器rax
从零开始,每次迭代增加2*sizeof(double)=16
字节。 It is then compared with sizeof(double)*10000=80000
each iteration to test if the loop is finished. 然后将其与每次迭代的sizeof(double)*10000=80000
进行比较,以测试循环是否结束。
The use of cmp
here is actually an inefficiency in the GCC compiler. 这里使用cmp
实际上是GCC编译器的低效率。 Modern Intel processors can fuse the cmp
and jne
instruction into one instruction and they can also fuse add
and jne
into one instruction but they cannot fuse add
, cmp
, and jne
into one instruction. 现代英特尔处理器可以将cmp
和jne
指令融合到一条指令中,它们也可以将add
和jne
融合到一条指令中,但它们不能将add
, cmp
和jne
融合到一条指令中。 But it's possible to remove the cmp
instruction . 但是可以删除cmp
指令 。
What GCC should have done is set GCC应该做些什么
rdi = &x[0] + 80000;
rsi = &y[0] + 80000;
rax = -80000
and then the loop could be done like this 然后循环可以像这样完成
movapd (%rdi,%rax), %xmm0 ; temp = x[i]
addpd (%rsi,%rax), %xmm0 ; temp += y[i]
movapd %xmm0, (%rdi,%rax) ; x[i] = temp
addq $16, %rax ; i += 2
jnz .L3 ; then loop
Now the loop counts from -80000
up to 0
and does not need the cmp
instruction and the add
and jnz
will be fused into one micro-operation. 现在循环计数从-80000
到0
并且不需要cmp
指令, add
和jnz
将融合到一个微操作中。
movapd (%rdi,%rax), %xmm0 ; temp = x[i]
addpd (%rsi,%rax), %xmm0 ; temp += y[i]
movapd %xmm0, (%rdi,%rax) ; x[i] = temp
addq $16, %rax ; i += 2
cmpq $80000, %rax ; if (i < SIZE)
jne .L3 ; then loop
The rax register represents the i
variable, but stored as a byte index, rdi is &x, rsi is &y. rax寄存器表示i
变量,但存储为字节索引,rdi是&x,rsi是&y。 Each pass through the loop adds two doubles, thus the increment of rax by 2 * sizeof(double) or 16 bytes. 每次通过循环都会增加两个双精度数,因此rax的增量为2 * sizeof(double)或16个字节。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.