使用英特尔编译器的Windows和Linux之间的性能差异：查看程序集

Question

I am running a program on both Windows and Linux (x86-64). 我在Windows和Linux（x86-64）上运行程序。 It has been compiled with the same compiler (Intel Parallel Studio XE 2017) with the same options, and the Windows version is 3 times faster than the Linux one. 它使用相同的编译器（Intel Parallel Studio XE 2017）编译，具有相同的选项，Windows版本比Linux版本快3倍。 The culprit is a call to std::erf which is resolved in the Intel math library for both cases (by default, it is linked dynamically on Windows and statically on Linux but using dynamic linking on Linux gives the same performance). 罪魁祸首是对std::erf的调用，在两种情况下都在英特尔数学库中解析（默认情况下，它在Windows上动态链接，在Linux上静态链接，但在Linux上使用动态链接可以提供相同的性能）。

Here is a simple program to reproduce the problem. 这是一个重现问题的简单程序。

#include <cmath>
#include <cstdio>

int main() {
  int n = 100000000;
  float sum = 1.0f;

  for (int k = 0; k < n; k++) {
    sum += std::erf(sum);
  }

  std::printf("%7.2f\n", sum);
}

When I profile this program using vTune, I find that the assembly is a bit different in between the Windows and the Linux version. 当我使用vTune分析这个程序时，我发现Windows和Linux版本之间的程序集有点不同。 Here is the call site (the loop) on Windows 这是Windows上的调用站点（循环）

Block 3:
"vmovaps xmm0, xmm6"
call 0x1400023e0 <erff>
Block 4:
inc ebx
"vaddss xmm6, xmm6, xmm0"
"cmp ebx, 0x5f5e100"
jl 0x14000103f <Block 3>

And the beginning of the erf function called on Windows 并在Windows上调用erf函数的开头

Block 1:
push rbp
"sub rsp, 0x40"
"lea rbp, ptr [rsp+0x20]"
"lea rcx, ptr [rip-0xa6c81]"
"movd edx, xmm0"
"movups xmmword ptr [rbp+0x10], xmm6"
"movss dword ptr [rbp+0x30], xmm0"
"mov eax, edx"
"and edx, 0x7fffffff"
"and eax, 0x80000000"
"add eax, 0x3f800000"
"mov dword ptr [rbp], eax"
"movss xmm6, dword ptr [rbp]"
"cmp edx, 0x7f800000"
...

On Linux, the code is a bit different. 在Linux上，代码有点不同。 The call site is: 呼叫站点是：

Block 3
"vmovaps %xmm1, %xmm0"
"vmovssl  %xmm1, (%rsp)"
callq  0x400bc0 <erff>
Block 4
inc %r12d
"vmovssl  (%rsp), %xmm1"
"vaddss %xmm0, %xmm1, %xmm1"   <-------- hotspot here
"cmp $0x5f5e100, %r12d"
jl 0x400b6b <Block 3>

and the beginning of the called function (erf) is: 并且被调用函数（erf）的开头是：

"movd %xmm0, %edx"
"movssl  %xmm0, -0x10(%rsp)"   <-------- hotspot here
"mov %edx, %eax"
"and $0x7fffffff, %edx"
"and $0x80000000, %eax"
"add $0x3f800000, %eax"
"movl  %eax, -0x18(%rsp)"
"movssl  -0x18(%rsp), %xmm0"
"cmp $0x7f800000, %edx"
jnl 0x400dac <Block 8>
...

I have shown the 2 points where the time is lost on Linux. 我已经展示了Linux上丢失时间的2个点。

Does anyone understand assembly enough to explain me the difference of the 2 codes and why the Linux version is 3 times slower? 有没有人理解组装足以解释我2代码的区别以及为什么Linux版本慢3倍？

Answer 1

In both cases the arguments and results are passed only in registers, as per the respective calling conventions on Windows and GNU/Linux. 在这两种情况下，根据Windows和GNU / Linux上的相应调用约定，参数和结果仅在寄存器中传递。

In the GNU/Linux variant, the xmm1 is used for accumulating the sum. 在GNU / Linux变体中， xmm1用于累加和。 Since it's a call-clobbered register (aka caller-saved) it's stored (and restored) in the stack frame of the caller on each call. 由于它是一个call-clobbered寄存器（也称为调用者保存），因此在每次调用时都会在调用者的堆栈帧中存储（和恢复）。

In the Windows variant, the xmm6 is used for accumulating the sum. 在Windows变体中， xmm6用于累积总和。 This register is callee-saved in the Windows calling convention ( but not in the GNU/Linux one ). 该寄存器在Windows调用约定中被调用保存（ 但不在GNU / Linux中 ）。

So, in summary, the GNU/Linux version saves/restores both xmm0 (in the callee[1]) and xmm1 (in the caller), whereas the Windows version saves/restores only xmm6 (in the callee). 因此，总之，GNU / Linux版本保存/恢复xmm0 （在被调用者[1]中）和xmm1 （在调用者中），而Windows版本仅保存/恢复xmm6 （在被调用者中）。

[1] need to look at std::errf to figure out why. [1]需要查看std::errf以找出原因。

Answer 2

Using Visual Studio 2015, Win 7 64 bit mode, I find the following code for some of the paths used in erf() (not all paths shown). 使用Visual Studio 2015，Win 7 64位模式，我找到了erf（）中使用的一些路径的以下代码（并非显示所有路径）。 Each path involves up to 8 (maybe more for other paths) constants read from memory, so a single store / load to save a register seems unlikely to result in a 3x speed differential between Linux and Windows. 每个路径涉及从内存中读取的最多8个（可能更多用于其他路径）常量，因此保存寄存器的单个存储/加载似乎不太可能导致Linux和Windows之间的3倍速差。 As far for save / restores, this example saves and restores xmm6 and xmm7. 至于保存/恢复，此示例保存并恢复xmm6和xmm7。 As for the time, the program in the original post takes about 0.86 seconds on an Intel 3770K (3.5ghz cpu) (VS2015 / Win 7 64 bit). 至于时间，原始帖子中的程序在Intel 3770K（3.5ghz cpu）（VS2015 / Win 7 64 bit）上大约需要0.86秒。 Update - I later determined the overhead for a save and restore of a xmm register is about 0.03 seconds in the case of the programs 10^8 loops (about 3 nanoseconds per loop). 更新 - 我后来确定，在程序10 ^ 8循环（每个循环约3纳秒）的情况下，保存和恢复xmm寄存器的开销约为0.03秒。

000007FEEE25CF90  mov         rax,rsp  
000007FEEE25CF93  movss       dword ptr [rax+8],xmm0  
000007FEEE25CF98  sub         rsp,48h  
000007FEEE25CF9C  movaps      xmmword ptr [rax-18h],xmm6  
000007FEEE25CFA0  lea         rcx,[rax+8]  
000007FEEE25CFA4  movaps      xmmword ptr [rax-28h],xmm7  
000007FEEE25CFA8  movaps      xmm6,xmm0  
000007FEEE25CFAB  call        000007FEEE266370  
000007FEEE25CFB0  movsx       ecx,ax  
000007FEEE25CFB3  test        ecx,ecx  
000007FEEE25CFB5  je          000007FEEE25D0AF  
000007FEEE25CFBB  sub         ecx,1  
000007FEEE25CFBE  je          000007FEEE25D08F  
000007FEEE25CFC4  cmp         ecx,1  
000007FEEE25CFC7  je          000007FEEE25D0AF  
000007FEEE25CFCD  xorps       xmm7,xmm7  
000007FEEE25CFD0  movaps      xmm2,xmm6  
000007FEEE25CFD3  comiss      xmm7,xmm6  
000007FEEE25CFD6  jbe         000007FEEE25CFDF  
000007FEEE25CFD8  xorps       xmm2,xmmword ptr [7FEEE2991E0h]  
000007FEEE25CFDF  movss       xmm0,dword ptr [7FEEE298E50h]  
000007FEEE25CFE7  comiss      xmm0,xmm2  
000007FEEE25CFEA  jbe         000007FEEE25D053  
000007FEEE25CFEC  movaps      xmm2,xmm6  
000007FEEE25CFEF  mulss       xmm2,xmm6  
000007FEEE25CFF3  movaps      xmm0,xmm2  
000007FEEE25CFF6  movaps      xmm1,xmm2  
000007FEEE25CFF9  mulss       xmm0,dword ptr [7FEEE298B34h]  
000007FEEE25D001  mulss       xmm1,dword ptr [7FEEE298B5Ch]  
000007FEEE25D009  addss       xmm0,dword ptr [7FEEE298B8Ch]  
000007FEEE25D011  addss       xmm1,dword ptr [7FEEE298B9Ch]  
000007FEEE25D019  mulss       xmm0,xmm2  
000007FEEE25D01D  mulss       xmm1,xmm2  
000007FEEE25D021  addss       xmm0,dword ptr [7FEEE298BB8h]  
000007FEEE25D029  addss       xmm1,dword ptr [7FEEE298C88h]  
000007FEEE25D031  mulss       xmm0,xmm2  
000007FEEE25D035  mulss       xmm1,xmm2  
000007FEEE25D039  addss       xmm0,dword ptr [7FEEE298DC8h]  
000007FEEE25D041  addss       xmm1,dword ptr [7FEEE298D8Ch]  
000007FEEE25D049  divss       xmm0,xmm1  
000007FEEE25D04D  mulss       xmm0,xmm6  
000007FEEE25D051  jmp         000007FEEE25D0B2  
000007FEEE25D053  movss       xmm1,dword ptr [7FEEE299028h]  
000007FEEE25D05B  comiss      xmm1,xmm2  
000007FEEE25D05E  jbe         000007FEEE25D076  
000007FEEE25D060  movaps      xmm0,xmm2  
000007FEEE25D063  call        000007FEEE25CF04  
000007FEEE25D068  movss       xmm1,dword ptr [7FEEE298D8Ch]  
000007FEEE25D070  subss       xmm1,xmm0  
000007FEEE25D074  jmp         000007FEEE25D07E  
000007FEEE25D076  movss       xmm1,dword ptr [7FEEE298D8Ch]  
000007FEEE25D07E  comiss      xmm7,xmm6  
000007FEEE25D081  jbe         000007FEEE25D08A  
000007FEEE25D083  xorps       xmm1,xmmword ptr [7FEEE2991E0h]  
000007FEEE25D08A  movaps      xmm0,xmm1  
000007FEEE25D08D  jmp         000007FEEE25D0B2  
000007FEEE25D08F  mov         eax,8000h  
000007FEEE25D094  test        word ptr [rsp+52h],ax  
000007FEEE25D099  je          000007FEEE25D0A5  
000007FEEE25D09B  movss       xmm0,dword ptr [7FEEE2990DCh]  
000007FEEE25D0A3  jmp         000007FEEE25D0B2  
000007FEEE25D0A5  movss       xmm0,dword ptr [7FEEE298D8Ch]  
000007FEEE25D0AD  jmp         000007FEEE25D0B2  
000007FEEE25D0AF  movaps      xmm0,xmm6  
000007FEEE25D0B2  movaps      xmm6,xmmword ptr [rsp+30h]  
000007FEEE25D0B7  movaps      xmm7,xmmword ptr [rsp+20h]  
000007FEEE25D0BC  add         rsp,48h  
000007FEEE25D0C0  ret

使用英特尔编译器的Windows和Linux之间的性能差异：查看程序集

问题描述

2 个解决方案

解决方案1
42 已采纳 2016-11-10 10:02:25

解决方案2
3 2016-11-10 10:15:19

使用英特尔编译器的Windows和Linux之间的性能差异：查看程序集

问题描述

2 个解决方案

解决方案1 42 已采纳 2016-11-10 10:02:25

解决方案2 3 2016-11-10 10:15:19

解决方案1
42 已采纳 2016-11-10 10:02:25

解决方案2
3 2016-11-10 10:15:19