MSVC和GCC之间的性能差异，用于高度优化的矩阵乘法代码

Question

I'm seeing a big difference in performance between code compiled in MSVC (on Windows) and GCC (on Linux) for an Ivy Bridge system. 我发现在MSVC（在Windows上）和GCC（在Linux上）为Ivy Bridge系统编译的代码之间的性能差异很大。 The code does dense matrix multiplication. 代码执行密集矩阵乘法。 I'm getting 70% of the peak flops with GCC and only 50% with MSVC. 我在GCC中获得了70％的峰值失误，而在MSVC中只有50％。 I think I may have isolated the difference to how they both convert the following three intrinsics. 我想我可能已经把他们两个内在函数如何转换的差异分开了。

__m256 breg0 = _mm256_loadu_ps(&b[8*i])
_mm256_add_ps(_mm256_mul_ps(arge0,breg0), tmp0)

GCC does this GCC这样做

vmovups ymm9, YMMWORD PTR [rax-256]
vmulps  ymm9, ymm0, ymm9
vaddps  ymm8, ymm8, ymm9

MSVC does this MSVC这样做

vmulps   ymm1, ymm2, YMMWORD PTR [rax-256]
vaddps   ymm3, ymm1, ymm3

Could somebody please explain to me if and why these two solutions could give such a big difference in performance? 有人可以向我解释这两种解决方案是否以及为何能够在性能上产生如此大的差异？

Despite MSVC using one less instruction it ties the load to the mult and maybe that makes it more dependent (maybe the load can't be done out of order)? 尽管MSVC使用少一条指令，但它将负载与多线程相关联，这可能会使其更加依赖（可能无法按顺序完成负载）？ I mean Ivy Bridge can do one AVX load, one AVX mult, and one AVX add in one clock cycle but this requires each operation to be independent. 我的意思是Ivy Bridge可以在一个时钟周期内完成一个AVX加载，一个AVX多路复用和一个AVX加载，但这要求每个操作都是独立的。

Maybe the problem lies elsewhere? 也许问题出在其他地方？ You can see the full assembly code for GCC and MSVC for the innermost loop below. 您可以在下面看到最里面循环的GCC和MSVC的完整汇编代码。 You can see the C++ code for the loop here Loop unrolling to achieve maximum throughput with Ivy Bridge and Haswell 你可以在这里看到循环的C ++代码循环展开以实现Ivy Bridge和Haswell的最大吞吐量

g++ -S -masm=intel matrix.cpp -O3 -mavx -fopenmp g ++ -S -masm = intel matrix.cpp -O3 -mavx -fopenmp

.L4:
    vbroadcastss    ymm0, DWORD PTR [rcx+rdx*4]
    add rdx, 1
    add rax, 256
    vmovups ymm9, YMMWORD PTR [rax-256]
    vmulps  ymm9, ymm0, ymm9
    vaddps  ymm8, ymm8, ymm9
    vmovups ymm9, YMMWORD PTR [rax-224]
    vmulps  ymm9, ymm0, ymm9
    vaddps  ymm7, ymm7, ymm9
    vmovups ymm9, YMMWORD PTR [rax-192]
    vmulps  ymm9, ymm0, ymm9
    vaddps  ymm6, ymm6, ymm9
    vmovups ymm9, YMMWORD PTR [rax-160]
    vmulps  ymm9, ymm0, ymm9
    vaddps  ymm5, ymm5, ymm9
    vmovups ymm9, YMMWORD PTR [rax-128]
    vmulps  ymm9, ymm0, ymm9
    vaddps  ymm4, ymm4, ymm9
    vmovups ymm9, YMMWORD PTR [rax-96]
    vmulps  ymm9, ymm0, ymm9
    vaddps  ymm3, ymm3, ymm9
    vmovups ymm9, YMMWORD PTR [rax-64]
    vmulps  ymm9, ymm0, ymm9
    vaddps  ymm2, ymm2, ymm9
    vmovups ymm9, YMMWORD PTR [rax-32]
    cmp esi, edx
    vmulps  ymm0, ymm0, ymm9
    vaddps  ymm1, ymm1, ymm0
    jg  .L4

MSVC /FAc /O2 /openmp /arch:AVX ... MSVC / FAc / O2 / openmp / arch：AVX ......

vbroadcastss ymm2, DWORD PTR [r10]    
lea  rax, QWORD PTR [rax+256]
lea  r10, QWORD PTR [r10+4] 
vmulps   ymm1, ymm2, YMMWORD PTR [rax-320]
vaddps   ymm3, ymm1, ymm3    
vmulps   ymm1, ymm2, YMMWORD PTR [rax-288]
vaddps   ymm4, ymm1, ymm4    
vmulps   ymm1, ymm2, YMMWORD PTR [rax-256]
vaddps   ymm5, ymm1, ymm5    
vmulps   ymm1, ymm2, YMMWORD PTR [rax-224]
vaddps   ymm6, ymm1, ymm6    
vmulps   ymm1, ymm2, YMMWORD PTR [rax-192]
vaddps   ymm7, ymm1, ymm7    
vmulps   ymm1, ymm2, YMMWORD PTR [rax-160]
vaddps   ymm8, ymm1, ymm8    
vmulps   ymm1, ymm2, YMMWORD PTR [rax-128]
vaddps   ymm9, ymm1, ymm9    
vmulps   ymm1, ymm2, YMMWORD PTR [rax-96]
vaddps   ymm10, ymm1, ymm10    
dec  rdx
jne  SHORT $LL3@AddDot4x4_

EDIT: 编辑：

I benchmark the code by claculating the total floating point operations as 2.0*n^3 where n is the width of the square matrix and dividing by the time measured with omp_get_wtime() . 我通过将总浮点运算计算为2.0*n^3对代码进行基准测试，其中n是方阵的宽度，除以用omp_get_wtime()测量的时间。 I repeat the loop several times. 我重复循环几次。 In the output below I repeated it 100 times. 在下面的输出中，我重复了100次。

Output from MSVC2012 on an Intel Xeon E5 1620 (Ivy Bridge) turbo for all cores is 3.7 GHz 所有内核的Intel Xeon E5 1620（Ivy Bridge）turbo上的MSVC2012输出为3.7 GHz

maximum GFLOPS = 236.8 = (8-wide SIMD) * (1 AVX mult + 1 AVX add) * (4 cores) * 3.7 GHz

n   64,     0.02 ms, GFLOPs   0.001, GFLOPs/s   23.88, error 0.000e+000, efficiency/core   40.34%, efficiency  10.08%, mem 0.05 MB
n  128,     0.05 ms, GFLOPs   0.004, GFLOPs/s   84.54, error 0.000e+000, efficiency/core  142.81%, efficiency  35.70%, mem 0.19 MB
n  192,     0.17 ms, GFLOPs   0.014, GFLOPs/s   85.45, error 0.000e+000, efficiency/core  144.34%, efficiency  36.09%, mem 0.42 MB
n  256,     0.29 ms, GFLOPs   0.034, GFLOPs/s  114.48, error 0.000e+000, efficiency/core  193.37%, efficiency  48.34%, mem 0.75 MB
n  320,     0.59 ms, GFLOPs   0.066, GFLOPs/s  110.50, error 0.000e+000, efficiency/core  186.66%, efficiency  46.67%, mem 1.17 MB
n  384,     1.39 ms, GFLOPs   0.113, GFLOPs/s   81.39, error 0.000e+000, efficiency/core  137.48%, efficiency  34.37%, mem 1.69 MB
n  448,     3.27 ms, GFLOPs   0.180, GFLOPs/s   55.01, error 0.000e+000, efficiency/core   92.92%, efficiency  23.23%, mem 2.30 MB
n  512,     3.60 ms, GFLOPs   0.268, GFLOPs/s   74.63, error 0.000e+000, efficiency/core  126.07%, efficiency  31.52%, mem 3.00 MB
n  576,     3.93 ms, GFLOPs   0.382, GFLOPs/s   97.24, error 0.000e+000, efficiency/core  164.26%, efficiency  41.07%, mem 3.80 MB
n  640,     5.21 ms, GFLOPs   0.524, GFLOPs/s  100.60, error 0.000e+000, efficiency/core  169.93%, efficiency  42.48%, mem 4.69 MB
n  704,     6.73 ms, GFLOPs   0.698, GFLOPs/s  103.63, error 0.000e+000, efficiency/core  175.04%, efficiency  43.76%, mem 5.67 MB
n  768,     8.55 ms, GFLOPs   0.906, GFLOPs/s  105.95, error 0.000e+000, efficiency/core  178.98%, efficiency  44.74%, mem 6.75 MB
n  832,    10.89 ms, GFLOPs   1.152, GFLOPs/s  105.76, error 0.000e+000, efficiency/core  178.65%, efficiency  44.66%, mem 7.92 MB
n  896,    13.26 ms, GFLOPs   1.439, GFLOPs/s  108.48, error 0.000e+000, efficiency/core  183.25%, efficiency  45.81%, mem 9.19 MB
n  960,    16.36 ms, GFLOPs   1.769, GFLOPs/s  108.16, error 0.000e+000, efficiency/core  182.70%, efficiency  45.67%, mem 10.55 MB
n 1024,    17.74 ms, GFLOPs   2.147, GFLOPs/s  121.05, error 0.000e+000, efficiency/core  204.47%, efficiency  51.12%, mem 12.00 MB

Answer 1

Since we've covered the alignment issue, I would guess it's this: http://en.wikipedia.org/wiki/Out-of-order_execution 由于我们已经涵盖了对齐问题，我猜是这样的： http ： //en.wikipedia.org/wiki/Out-of-order_execution

Since g++ issues a standalone load instruction, your processor can reorder the instructions to be pre-fetching the next data that will be needed while also adding and multiplying. 由于g ++发出独立的加载指令，因此您的处理器可以对指令进行重新排序，以便在添加和乘法时预取下一个需要的数据。 MSVC throwing a pointer at mul makes the load and mul tied to the same instruction, so changing the execution order of the instructions doesn't help anything. MSVC向mul抛出指针使得load和mul绑定到同一条指令，因此更改指令的执行顺序对任何事情都无济于事。

EDIT: Intel's server(s) with all the docs are less angry today, so here's more research on why out of order execution is (part of) the answer. 编辑：Intel的服务器（S）的所有文档都少生气今天，所以这里的，为什么乱序执行的是（部分）的答案进行更多的研究。

First of all, it looks like your comment is completely right about it being possible for the MSVC version of the multiplication instruction to decode to separate µ-ops that can be optimized by a CPU's out of order engine. 首先，看起来你的评论是完全正确的，因为MSVC版本的乘法指令可以解码以分离可以由CPU的乱序引擎优化的μ-ops。 The fun part here is that modern microcode sequencers are programmable, so the actual behavior is both hardware and firmware dependent. 这里有趣的部分是现代微码序列发生器是可编程的，因此实际行为依赖于硬件和固件。 The differences in the generated assembly seems to be from GCC and MSVC each trying to fight different potential bottlenecks. 生成的程序集的差异似乎来自GCC和MSVC各自试图对抗不同的潜在瓶颈。 The GCC version tries to give leeway to the out of order engine (as we've already covered). GCC版本试图为无序引擎提供余地（正如我们已经介绍过的那样）。 However, the MSVC version ends up taking advantage of a feature called "micro-op fusion". 然而，MSVC版本最终利用了一种称为“微操作融合”的功能。 This is because of the µ-op retirement limitations. 这是因为μ-op退休限制。 The end of the pipeline can only retire 3 µ-ops per tick. 管道的末端每个滴答只能退出3μ-ops。 Micro-op fusion, in specific cases, takes two µ-ops that must be done on two different execution units (ie memory read and arithmetic) and ties them to a single µ-op for most of the pipeline. 在特定情况下，微操作融合需要两个μ-ops，必须在两个不同的执行单元（即存储器读取和算术）上完成，并将它们连接到大多数流水线的单个μ-op。 The fused µ-op is only split into the two real µ-ops right before execution unit assignment. 融合的μ-op仅在执行单元分配之前被分成两个实际的μ-op。 After the execution, the ops are fused again, allowing them to be retired as one. 执行后，ops再次融合，允许它们作为一个退役。

The out of order engine only sees the fused µ-op, so it can't pull the load op away from the multiplication. 无序引擎只能看到融合的μ-op，因此它无法将负载op拉离乘法。 This causes the pipeline to hang while waiting for the next operand to finish its bus ride. 这会导致管道在等待下一个操作数完成其公共汽车时挂起。

ALL THE LINKS!!!: http://download-software.intel.com/sites/default/files/managed/71/2e/319433-017.pdf 所有链接!!!： http ： //download-software.intel.com/sites/default/files/managed/71/2e/319433-017.pdf

http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

http://www.agner.org/optimize/microarchitecture.pdf http://www.agner.org/optimize/microarchitecture.pdf

http://www.agner.org/optimize/optimizing_assembly.pdf http://www.agner.org/optimize/optimizing_assembly.pdf

http://www.agner.org/optimize/instruction_tables.ods (NOTE: Excel complains that this spreadsheet is partially corrupted or otherwise sketchy, so open at your own risk. It doesn't seem to be malicious, though, and according to the rest of my research, Agner Fog is awesome. After I opted-in to the Excel recovery step, I found it full of tons of great data) http://www.agner.org/optimize/instruction_tables.ods （注意：Excel抱怨此电子表格部分损坏或粗略，因此请自行承担风险。但这似乎并不是恶意的，并且根据在我的研究的其余部分，Agner Fog非常棒。在我选择了Excel恢复步骤之后，我发现它充满了大量的数据。

http://cs.nyu.edu/courses/fall13/CSCI-GA.3033-008/Microprocessor-Report-Sandy-Bridge-Spans-Generations-243901.pdf http://cs.nyu.edu/courses/fall13/CSCI-GA.3033-008/Microprocessor-Report-Sandy-Bridge-Spans-Generations-243901.pdf

http://www.syncfusion.com/Content/downloads/ebook/Assembly_Language_Succinctly.pdf http://www.syncfusion.com/Content/downloads/ebook/Assembly_Language_Succinctly.pdf

MUCH LATER EDIT: Wow, there has been some interesting update to the discussion here. 很多编辑：哇，这里的讨论有一些有趣的更新。 I guess I was mistaken about how much of the pipeline is actually affected by micro op fusion. 我想我误解了微操作融合实际影响了多少管道。 Maybe there is more perf gain than I expected from the the differences in the loop condition check, where the unfused instructions allow GCC to interleave the compare and jump with the last vector load and arithmetic steps? 也许从循环条件检查的差异中可以获得比预期更多的性能增益，其中未融合的指令允许GCC将比较和跳转与最后的向量加载和算术步骤交错？

vmovups ymm9, YMMWORD PTR [rax-32]
cmp esi, edx
vmulps  ymm0, ymm0, ymm9
vaddps  ymm1, ymm1, ymm0
jg  .L4

Answer 2

I can confirm that using the GCC code in Visual Studio does indeed improve the performance. 我可以确认在Visual Studio中使用GCC代码确实提高了性能。 I did this by converting the GCC object file in Linux to work in Visual Studio . 我是通过将Linux中的GCC目标文件转换为在Visual Studio中工作来实现的。 The efficient went from 50% to 60% using all four cores (and 60% to 70% for a single core). 使用所有四个核心（和单个核心的60％到70％），效率从50％提高到60％。

Microsoft has removed inline assembly from 64-bit code and also broken their 64-bit dissembler so that code can't be resembled without modification ( but the 32-bit version still works ). 微软已经从64位代码中删除了内联汇编，并且还破坏了它们的64位分解器，因此代码无法在不进行修改的情况下完成（但32位版本仍可正常工作）。 They evidently thought intrinsics would be sufficient but as this case shows they are wrong. 他们显然认为内在就足够了，但这个案例表明他们错了。

Maybe fused instructions should be separate intrinsics? 也许融合指令应该是独立的内在函数？

But Microsoft is not the only one that produces less optimal intrinsic code. 但微软并不是唯一一个产生不太优化的内在代码的微软。 If you put the code below into http://gcc.godbolt.org/ you can see what Clang, ICC, and GCC do. 如果您将以下代码放入http://gcc.godbolt.org/，您可以看到Clang，ICC和GCC的作用。 ICC gave even worse performance than MSVC. ICC的表现甚至比MSVC还差。 It is using vinsertf128 but I don't know why. 它使用的是vinsertf128但我不知道为什么。 I'm not sure what Clang is doing but it looks to be closer to GCC just in a different order (and more code). 我不确定Clang正在做什么，但它看起来更接近GCC只是以不同的顺序（和更多的代码）。

This explains why Agner Fog wrote in his manual " Optimizing subroutines in assembly language " in regards to "disadvantages of using intrinsic functions": 这就解释了为什么Agner Fog在他的手册“ 使用汇编语言优化子程序 ”中写到了“使用内部函数的缺点”：

The compiler can modify the code or implement it in a less efficient way than the programmer intended. 编译器可以修改代码或以低于程序员预期的方式实现代码。 It may be necessary to look at the code generated by the compiler to see if it is optimized in the way the programmer intended. 可能有必要查看编译器生成的代码，以查看它是否按程序员的预期方式进行了优化。

This is disappointing for the case for using intrinsics. 对于使用内在函数的情况，这是令人失望的。 This means one either has to still write 64-bit assembly code soemtimes or find a compiler which implements the intrinsics the way the programmer intended. 这意味着要么仍然要编写64位汇编代码soemtimes，要么找到一个编译器，它按照程序员的意图实现内在函数。 In this case only GCC appears to do that (and perhaps Clang). 在这种情况下，只有GCC似乎这样做（也许是Clang）。

#include <immintrin.h>
extern "C" void AddDot4x4_vec_block_8wide(const int n, const float *a, const float *b, float *c, const int stridea, const int strideb, const int stridec) {     
    const int vec_size = 8;
    __m256 tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7;
    tmp0 = _mm256_loadu_ps(&c[0*vec_size]);
    tmp1 = _mm256_loadu_ps(&c[1*vec_size]);
    tmp2 = _mm256_loadu_ps(&c[2*vec_size]);
    tmp3 = _mm256_loadu_ps(&c[3*vec_size]);
    tmp4 = _mm256_loadu_ps(&c[4*vec_size]);
    tmp5 = _mm256_loadu_ps(&c[5*vec_size]);
    tmp6 = _mm256_loadu_ps(&c[6*vec_size]);
    tmp7 = _mm256_loadu_ps(&c[7*vec_size]);

    for(int i=0; i<n; i++) {
        __m256 areg0 = _mm256_set1_ps(a[i]);

        __m256 breg0 = _mm256_loadu_ps(&b[vec_size*(8*i + 0)]);
        tmp0 = _mm256_add_ps(_mm256_mul_ps(areg0,breg0), tmp0);    
        __m256 breg1 = _mm256_loadu_ps(&b[vec_size*(8*i + 1)]);
        tmp1 = _mm256_add_ps(_mm256_mul_ps(areg0,breg1), tmp1);
        __m256 breg2 = _mm256_loadu_ps(&b[vec_size*(8*i + 2)]);
        tmp2 = _mm256_add_ps(_mm256_mul_ps(areg0,breg2), tmp2);    
        __m256 breg3 = _mm256_loadu_ps(&b[vec_size*(8*i + 3)]);
        tmp3 = _mm256_add_ps(_mm256_mul_ps(areg0,breg3), tmp3);   
        __m256 breg4 = _mm256_loadu_ps(&b[vec_size*(8*i + 4)]);
        tmp4 = _mm256_add_ps(_mm256_mul_ps(areg0,breg4), tmp4);    
        __m256 breg5 = _mm256_loadu_ps(&b[vec_size*(8*i + 5)]);
        tmp5 = _mm256_add_ps(_mm256_mul_ps(areg0,breg5), tmp5);    
        __m256 breg6 = _mm256_loadu_ps(&b[vec_size*(8*i + 6)]);
        tmp6 = _mm256_add_ps(_mm256_mul_ps(areg0,breg6), tmp6);    
        __m256 breg7 = _mm256_loadu_ps(&b[vec_size*(8*i + 7)]);
        tmp7 = _mm256_add_ps(_mm256_mul_ps(areg0,breg7), tmp7);    
    }
    _mm256_storeu_ps(&c[0*vec_size], tmp0);
    _mm256_storeu_ps(&c[1*vec_size], tmp1);
    _mm256_storeu_ps(&c[2*vec_size], tmp2);
    _mm256_storeu_ps(&c[3*vec_size], tmp3);
    _mm256_storeu_ps(&c[4*vec_size], tmp4);
    _mm256_storeu_ps(&c[5*vec_size], tmp5);
    _mm256_storeu_ps(&c[6*vec_size], tmp6);
    _mm256_storeu_ps(&c[7*vec_size], tmp7);
}

Answer 3

MSVC did exactly what you asked it to. MSVC完全按照你的要求行事。 If you want a vmovups instruction emitted, use the _mm256_loadu_ps intrinsic. 如果要发出vmovups指令，请使用_mm256_loadu_ps内在函数。

MSVC和GCC之间的性能差异，用于高度优化的矩阵乘法代码

问题描述

3 个解决方案

解决方案1
21 已采纳 2014-01-16 01:28:52

解决方案2
6 2014-01-21 12:13:48

解决方案3
3 2014-01-18 01:25:25

MSVC和GCC之间的性能差异，用于高度优化的矩阵乘法代码

问题描述

3 个解决方案

解决方案1 21 已采纳 2014-01-16 01:28:52

解决方案2 6 2014-01-21 12:13:48

解决方案3 3 2014-01-18 01:25:25

解决方案1
21 已采纳 2014-01-16 01:28:52

解决方案2
6 2014-01-21 12:13:48

解决方案3
3 2014-01-18 01:25:25