错误的gcc生成的装配顺序，导致性能损失

Question

I have got the following code, which copies data from memory to DMA buffer: 我有以下代码，它将数据从内存复制到DMA缓冲区：

for (; likely(l > 0); l-=128)
{
    __m256i m0 = _mm256_load_si256( (__m256i*) (src) );
    __m256i m1 = _mm256_load_si256( (__m256i*) (src+32) );
    __m256i m2 = _mm256_load_si256( (__m256i*) (src+64) );
    __m256i m3 = _mm256_load_si256( (__m256i*) (src+96) );

    _mm256_stream_si256( (__m256i *) (dst), m0 );
    _mm256_stream_si256( (__m256i *) (dst+32), m1 );
    _mm256_stream_si256( (__m256i *) (dst+64), m2 );
    _mm256_stream_si256( (__m256i *) (dst+96), m3 );

    src += 128;
    dst += 128;
}

That is how gcc assembly output looks like: 这就是gcc程序集输出的样子：

405280:       c5 fd 6f 50 20          vmovdqa 0x20(%rax),%ymm2
405285:       c5 fd 6f 48 40          vmovdqa 0x40(%rax),%ymm1
40528a:       c5 fd 6f 40 60          vmovdqa 0x60(%rax),%ymm0
40528f:       c5 fd 6f 18             vmovdqa (%rax),%ymm3
405293:       48 83 e8 80             sub    $0xffffffffffffff80,%rax
405297:       c5 fd e7 52 20          vmovntdq %ymm2,0x20(%rdx)
40529c:       c5 fd e7 4a 40          vmovntdq %ymm1,0x40(%rdx)
4052a1:       c5 fd e7 42 60          vmovntdq %ymm0,0x60(%rdx)
4052a6:       c5 fd e7 1a             vmovntdq %ymm3,(%rdx)
4052aa:       48 83 ea 80             sub    $0xffffffffffffff80,%rdx
4052ae:       48 39 c8                cmp    %rcx,%rax
4052b1:       75 cd                   jne    405280 <sender_body+0x6e0>

Note the reordering of last vmovdqa and vmovntdq instructions. 请注意最后一个vmovdqa和vmovntdq指令的重新排序。 With the gcc generated code above I am able to reach throughput of ~10 227 571 packets per second in my application. 使用gcc生成的代码，我能够在我的应用程序中达到每秒约10 227 571个数据包的吞吐量。

Next, I reorder that instructions manually in hexeditor. 接下来，我在hexeditor中手动重新排序指令。 That means now the loop looks the following way: 这意味着现在循环看起来如下：

405280:       c5 fd 6f 18             vmovdqa (%rax),%ymm3
405284:       c5 fd 6f 50 20          vmovdqa 0x20(%rax),%ymm2
405289:       c5 fd 6f 48 40          vmovdqa 0x40(%rax),%ymm1
40528e:       c5 fd 6f 40 60          vmovdqa 0x60(%rax),%ymm0
405293:       48 83 e8 80             sub    $0xffffffffffffff80,%rax
405297:       c5 fd e7 1a             vmovntdq %ymm3,(%rdx)
40529b:       c5 fd e7 52 20          vmovntdq %ymm2,0x20(%rdx)
4052a0:       c5 fd e7 4a 40          vmovntdq %ymm1,0x40(%rdx)
4052a5:       c5 fd e7 42 60          vmovntdq %ymm0,0x60(%rdx)
4052aa:       48 83 ea 80             sub    $0xffffffffffffff80,%rdx
4052ae:       48 39 c8                cmp    %rcx,%rax
4052b1:       75 cd                   jne    405280 <sender_body+0x6e0>

With the properly ordered instructions I get ~13 668 313 packets per second. 通过正确排序的指令，我得到每秒~13 668 313个数据包。 So it is obvious that reordering introduced by gcc reduces performance. 因此很明显， gcc引入的重新排序会降低性能。

Have you come across that? 你有遇到过吗？ Is this a known bug or should I fill a bug report? 这是一个已知的错误还是应该填写错误报告？

Compilation flags: 编译标志：

-O3 -pipe -g -msse4.1 -mavx

My gcc version: 我的gcc版本：

gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)

Answer 1

I find this problem interesting. 我觉得这个问题很有意思。 GCC is known for producing less than optimal code, but I find it fascinating to find ways to "encourage" it to produce better code (for hottest/bottleneck code only, of course), without micro-managing too heavily. GCC以生成不是最优的代码而闻名，但我发现找到“鼓励”它来生成更好的代码（当然只针对最热门/瓶颈代码）的方法很有吸引力，而不需要过多地进行微观管理。 In this particular case, I looked at three "tools" I use for such situations: 在这个特殊情况下，我查看了三种用于此类情况的“工具”：

volatile : If it is important the memory accesses occur in specific order, then volatile is a suitable tool. volatile ：如果重要的是内存访问按特定顺序发生，那么volatile是一个合适的工具。 Note that it can be overkill, and will lead to a separate load every time a volatile pointer is dereferenced. 请注意，它可能过度，并且每次解除引用volatile指针时都会导致单独的加载。
SSE/AVX load/store intrinsics can't be used with volatile pointers, because they are functions. SSE / AVX加载/存储内在函数不能与volatile指针一起使用，因为它们是函数。 Using something like _mm256_load_si256((volatile __m256i *)src); 使用像_mm256_load_si256((volatile __m256i *)src); implicitly casts it to const __m256i* , losing the volatile qualifier. 隐式地将它强制转换为const __m256i* ，失去volatile限定符。
We can directly dereference volatile pointers, though. 但是，我们可以直接取消引用volatile指针。 (load/store intrinsics are only needed when we need to tell the compiler that the data might be unaligned, or that we want a streaming store.) （只有当我们需要告诉编译器数据可能是未对齐的，或者我们想要一个流存储时，才需要加载/存储内在函数。）
```
 m0 = ((volatile __m256i *)src)[0]; m1 = ((volatile __m256i *)src)[1]; m2 = ((volatile __m256i *)src)[2]; m3 = ((volatile __m256i *)src)[3]; 
```
Unfortunately this doesn't help with the stores, because we want to emit streaming stores. 不幸的是，这对商店没有帮助，因为我们想要发布流媒体商店。 A *(volatile...)dst = tmp; A *(volatile...)dst = tmp; won't give us what we want. 不会给我们想要的东西。
__asm__ __volatile__ (""); as a compiler reordering barrier. 作为编译器重新排序的障碍。
This is the GNU C was of writing a compiler memory-barrier. 这是GNU C编写的一个编译器内存屏障。 (Stopping compile-time reordering without emitting an actual barrier instruction like mfence ). （停止编译时重新排序而不发出像mfence这样的实际屏障指令）。 It stops the compiler from reordering memory accesses across this statement. 它阻止编译器在此语句中重新排序内存访问。
Using an index limit for loop structures. 使用循环结构的索引限制。
GCC is known for pretty poor register usage. GCC以非常差的寄存器使用而闻名。 Earlier versions made a lot of unnecessary moves between registers, although that is pretty minimal nowadays. 早期版本在寄存器之间进行了许多不必要的移动，尽管现在这种移动很少。 However, testing on x86-64 across many versions of GCC indicate that in loops, it is better to use an index limit, rather than a independent loop variable, for best results. 但是，在许多版本的GCC上对x86-64进行测试表明，在循环中，最好使用索引限制而不是独立的循环变量来获得最佳结果。

Combining all the above, I constructed the following function (after a few iterations): 结合以上所有内容，我构造了以下函数（经过几次迭代）：

#include <stdlib.h>
#include <immintrin.h>

#define likely(x) __builtin_expect((x), 1)
#define unlikely(x) __builtin_expect((x), 0)

void copy(void *const destination, const void *const source, const size_t bytes)
{
    __m256i       *dst = (__m256i *)destination;
    const __m256i *src = (const __m256i *)source;
    const __m256i *end = (const __m256i *)source + bytes / sizeof (__m256i);

    while (likely(src < end)) {
        const __m256i m0 = ((volatile const __m256i *)src)[0];
        const __m256i m1 = ((volatile const __m256i *)src)[1];
        const __m256i m2 = ((volatile const __m256i *)src)[2];
        const __m256i m3 = ((volatile const __m256i *)src)[3];

        _mm256_stream_si256( dst,     m0 );
        _mm256_stream_si256( dst + 1, m1 );
        _mm256_stream_si256( dst + 2, m2 );
        _mm256_stream_si256( dst + 3, m3 );

        __asm__ __volatile__ ("");

        src += 4;
        dst += 4;
    }
}

Compiling it ( example.c ) using GCC-4.8.4 using 使用GCC-4.8.4编译它（ example.c ）

gcc -std=c99 -mavx2 -march=x86-64 -mtune=generic -O2 -S example.c

yields ( example.s ): 收益率（ example.s ）：

        .file   "example.c"
        .text
        .p2align 4,,15
        .globl  copy
        .type   copy, @function
copy:
.LFB993:
        .cfi_startproc
        andq    $-32, %rdx
        leaq    (%rsi,%rdx), %rcx
        cmpq    %rcx, %rsi
        jnb     .L5
        movq    %rsi, %rax
        movq    %rdi, %rdx
        .p2align 4,,10
        .p2align 3
.L4:
        vmovdqa (%rax), %ymm3
        vmovdqa 32(%rax), %ymm2
        vmovdqa 64(%rax), %ymm1
        vmovdqa 96(%rax), %ymm0
        vmovntdq        %ymm3, (%rdx)
        vmovntdq        %ymm2, 32(%rdx)
        vmovntdq        %ymm1, 64(%rdx)
        vmovntdq        %ymm0, 96(%rdx)
        subq    $-128, %rax
        subq    $-128, %rdx
        cmpq    %rax, %rcx
        ja      .L4
        vzeroupper
.L5:
        ret
        .cfi_endproc
.LFE993:
        .size   copy, .-copy
        .ident  "GCC: (Ubuntu 4.8.4-2ubuntu1~14.04) 4.8.4"
        .section        .note.GNU-stack,"",@progbits

The disassembly of the actual compiled ( -c instead of -S ) code is 实际编译（ -c而不是-S ）代码的反汇编是

0000000000000000 <copy>:
   0:   48 83 e2 e0             and    $0xffffffffffffffe0,%rdx
   4:   48 8d 0c 16             lea    (%rsi,%rdx,1),%rcx
   8:   48 39 ce                cmp    %rcx,%rsi
   b:   73 41                   jae    4e <copy+0x4e>
   d:   48 89 f0                mov    %rsi,%rax
  10:   48 89 fa                mov    %rdi,%rdx
  13:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
  18:   c5 fd 6f 18             vmovdqa (%rax),%ymm3
  1c:   c5 fd 6f 50 20          vmovdqa 0x20(%rax),%ymm2
  21:   c5 fd 6f 48 40          vmovdqa 0x40(%rax),%ymm1
  26:   c5 fd 6f 40 60          vmovdqa 0x60(%rax),%ymm0
  2b:   c5 fd e7 1a             vmovntdq %ymm3,(%rdx)
  2f:   c5 fd e7 52 20          vmovntdq %ymm2,0x20(%rdx)
  34:   c5 fd e7 4a 40          vmovntdq %ymm1,0x40(%rdx)
  39:   c5 fd e7 42 60          vmovntdq %ymm0,0x60(%rdx)
  3e:   48 83 e8 80             sub    $0xffffffffffffff80,%rax
  42:   48 83 ea 80             sub    $0xffffffffffffff80,%rdx
  46:   48 39 c1                cmp    %rax,%rcx
  49:   77 cd                   ja     18 <copy+0x18>
  4b:   c5 f8 77                vzeroupper 
  4e:   c3                      retq

Without any optimizations at all, the code is completely disgusting, full of unnecessary moves, so some optimization is necessary. 在没有任何优化的情况下，代码完全令人作呕，充满了不必要的动作，因此需要进行一些优化。 (The above uses -O2 , which is generally the optimization level I use.) （以上使用-O2 ，这通常是我使用的优化级别。）

If optimizing for size ( -Os ), the code looks excellent at first glance, 如果优化大小（ -Os ），乍一看代码看起来很棒，

0000000000000000 <copy>:
   0:   48 83 e2 e0             and    $0xffffffffffffffe0,%rdx
   4:   48 01 f2                add    %rsi,%rdx
   7:   48 39 d6                cmp    %rdx,%rsi
   a:   73 30                   jae    3c <copy+0x3c>
   c:   c5 fd 6f 1e             vmovdqa (%rsi),%ymm3
  10:   c5 fd 6f 56 20          vmovdqa 0x20(%rsi),%ymm2
  15:   c5 fd 6f 4e 40          vmovdqa 0x40(%rsi),%ymm1
  1a:   c5 fd 6f 46 60          vmovdqa 0x60(%rsi),%ymm0
  1f:   c5 fd e7 1f             vmovntdq %ymm3,(%rdi)
  23:   c5 fd e7 57 20          vmovntdq %ymm2,0x20(%rdi)
  28:   c5 fd e7 4f 40          vmovntdq %ymm1,0x40(%rdi)
  2d:   c5 fd e7 47 60          vmovntdq %ymm0,0x60(%rdi)
  32:   48 83 ee 80             sub    $0xffffffffffffff80,%rsi
  36:   48 83 ef 80             sub    $0xffffffffffffff80,%rdi
  3a:   eb cb                   jmp    7 <copy+0x7>
  3c:   c3                      retq

until you notice that the last jmp is to the comparison, essentially doing a jmp , cmp , and a jae at every iteration, which probably yields pretty poor results. 直到你注意到最后一个jmp是比较，基本上在每次迭代时都会执行jmp ， cmp和jae ，这可能会产生非常差的结果。

Note: If you do something similar for real-world code, please do add comments (especially for the __asm__ __volatile__ (""); ), and remember to periodically check with all compilers available, to make sure the code is not compiled too badly by any. 注意：如果你为实际代码做类似的事情，请添加注释（特别是对于__asm__ __volatile__ (""); ），并记得定期检查所有可用的编译器，以确保代码编译得不是太糟糕任何。

Looking at Peter Cordes' excellent answer , I decided to iterate the function a bit further, just for fun. 看看Peter Cordes的优秀答案，我决定进一步迭代这个功能，只是为了好玩。

As Ross Ridge mentions in the comments, when using _mm256_load_si256() the pointer is not dereferenced (prior to being re-cast to aligned __m256i * as a parameter to the function), thus volatile won't help when using _mm256_load_si256() . 正如Ross Ridge在评论中提到的那样，当使用_mm256_load_si256() ，指针不会被解除引用（在重新转换为对齐__m256i *作为函数的参数之前），因此当使用_mm256_load_si256()时， volatile将_mm256_load_si256() 。 In another comment, Seb suggests a workaround: _mm256_load_si256((__m256i []){ *(volatile __m256i *)(src) }) , which supplies the function with a pointer to src by accessing the element via a volatile pointer and casting it to an array. 在另一个评论中，Seb提出了一个解决方法： _mm256_load_si256((__m256i []){ *(volatile __m256i *)(src) }) ，它通过一个易失性指针访问该元素并将其转换为该函数，为该函数提供指向src的指针数组。 For a simple aligned load, I prefer the direct volatile pointer; 对于简单的对齐加载，我更喜欢直接易失性指针; it matches my intent in the code. 它符合我在代码中的意图。 (I do aim for KISS, although often I hit only the stupid part of it.) （我确实以KISS为目标，虽然我经常只打它的愚蠢部分。）

On x86-64, the start of the inner loop is aligned to 16 bytes, so the number of operations in the function "header" part is not really important. 在x86-64上，内部循环的开始对齐为16个字节，因此函数“header”部分中的操作数量并不重要。 Still, avoiding the superfluous binary AND (masking the five least significant bits of the amount to copy in bytes) is certainly useful in general. 尽管如此，一般来说，避免多余的二进制AND（屏蔽要以字节为单位的数量的五个最低有效位）肯定是有用的。

GCC provides two options for this. GCC为此提供了两种选择。 One is the __builtin_assume_aligned() built-in, which allows a programmer to convey all sorts of alignment information to the compiler. 一个是内置的__builtin_assume_aligned() ，它允许程序员将各种对齐信息传递给编译器。 The other is typedef'ing a type that has extra attributes, here __attribute__((aligned (32))) , which can be used to convey the alignedness of function parameters for example. 另一种是typedef一个具有额外属性的类型，这里是__attribute__((aligned (32))) ，例如，它可以用来传达函数参数的__attribute__((aligned (32))) 。 Both of these should be available in clang (although support is recent, not in 3.5 yet), and may be available in others such as icc (although ICC, AFAIK, uses __assume_aligned() ). 这两个都应该在clang中可用（虽然支持是最近的，而不是3.5），并且可能在其他如icc中可用（尽管ICC，AFAIK，使用__assume_aligned() ）。

One way to mitigate the register shuffling GCC does, is to use a helper function. 减轻GCC注册混乱的一种方法是使用辅助函数。 After some further iterations, I arrived at this, another.c : 一些进一步的迭代之后，我来到这， another.c ：

#include <stdlib.h>
#include <immintrin.h>

#define likely(x)   __builtin_expect((x), 1)
#define unlikely(x) __builtin_expect((x), 0)

#if (__clang_major__+0 >= 3)
#define IS_ALIGNED(x, n) ((void *)(x))
#elif (__GNUC__+0 >= 4)
#define IS_ALIGNED(x, n) __builtin_assume_aligned((x), (n))
#else
#define IS_ALIGNED(x, n) ((void *)(x))
#endif

typedef __m256i __m256i_aligned __attribute__((aligned (32)));


void do_copy(register          __m256i_aligned *dst,
             register volatile __m256i_aligned *src,
             register          __m256i_aligned *end)
{
    do {
        register const __m256i m0 = src[0];
        register const __m256i m1 = src[1];
        register const __m256i m2 = src[2];
        register const __m256i m3 = src[3];

        __asm__ __volatile__ ("");

        _mm256_stream_si256( dst,     m0 );
        _mm256_stream_si256( dst + 1, m1 );
        _mm256_stream_si256( dst + 2, m2 );
        _mm256_stream_si256( dst + 3, m3 );

        __asm__ __volatile__ ("");

        src += 4;
        dst += 4;

    } while (likely(src < end));
}

void copy(void *dst, const void *src, const size_t bytes)
{
    if (bytes < 128)
        return;

    do_copy(IS_ALIGNED(dst, 32),
            IS_ALIGNED(src, 32),
            IS_ALIGNED((void *)((char *)src + bytes), 32));
}

which compiles with gcc -march=x86-64 -mtune=generic -mavx2 -O2 -S another.c to essentially (comments and directives omitted for brevity): 其中汇编了gcc -march=x86-64 -mtune=generic -mavx2 -O2 -S another.c到本质上（为简洁省略了注释和指令）：

do_copy:
.L3:
        vmovdqa  (%rsi), %ymm3
        vmovdqa  32(%rsi), %ymm2
        vmovdqa  64(%rsi), %ymm1
        vmovdqa  96(%rsi), %ymm0
        vmovntdq %ymm3, (%rdi)
        vmovntdq %ymm2, 32(%rdi)
        vmovntdq %ymm1, 64(%rdi)
        vmovntdq %ymm0, 96(%rdi)
        subq     $-128, %rsi
        subq     $-128, %rdi
        cmpq     %rdx, %rsi
        jb       .L3
        vzeroupper
        ret

copy:
        cmpq     $127, %rdx
        ja       .L8
        rep ret
.L8:
        addq     %rsi, %rdx
        jmp      do_copy

Further optimization at -O3 just inlines the helper function, 在-O3进一步优化只是内联辅助函数，

do_copy:
.L3:
        vmovdqa  (%rsi), %ymm3
        vmovdqa  32(%rsi), %ymm2
        vmovdqa  64(%rsi), %ymm1
        vmovdqa  96(%rsi), %ymm0
        vmovntdq %ymm3, (%rdi)
        vmovntdq %ymm2, 32(%rdi)
        vmovntdq %ymm1, 64(%rdi)
        vmovntdq %ymm0, 96(%rdi)
        subq     $-128, %rsi
        subq     $-128, %rdi
        cmpq     %rdx, %rsi
        jb       .L3
        vzeroupper
        ret

copy:
        cmpq     $127, %rdx
        ja       .L10
        rep ret
.L10:
        leaq     (%rsi,%rdx), %rax
.L8:
        vmovdqa  (%rsi), %ymm3
        vmovdqa  32(%rsi), %ymm2
        vmovdqa  64(%rsi), %ymm1
        vmovdqa  96(%rsi), %ymm0
        vmovntdq %ymm3, (%rdi)
        vmovntdq %ymm2, 32(%rdi)
        vmovntdq %ymm1, 64(%rdi)
        vmovntdq %ymm0, 96(%rdi)
        subq     $-128, %rsi
        subq     $-128, %rdi
        cmpq     %rsi, %rax
        ja       .L8
        vzeroupper
        ret

and even with -Os the generated code is very nice, 甚至使用-Os生成的代码非常好，

do_copy:
.L3:
        vmovdqa  (%rsi), %ymm3
        vmovdqa  32(%rsi), %ymm2
        vmovdqa  64(%rsi), %ymm1
        vmovdqa  96(%rsi), %ymm0
        vmovntdq %ymm3, (%rdi)
        vmovntdq %ymm2, 32(%rdi)
        vmovntdq %ymm1, 64(%rdi)
        vmovntdq %ymm0, 96(%rdi)
        subq     $-128, %rsi
        subq     $-128, %rdi
        cmpq     %rdx, %rsi
        jb       .L3
        ret

copy:
        cmpq     $127, %rdx
        jbe      .L5
        addq     %rsi, %rdx
        jmp      do_copy
.L5:
        ret

Of course, without optimizations GCC-4.8.4 still produces pretty bad code. 当然，如果没有优化，GCC-4.8.4仍会产生相当糟糕的代码。 With clang-3.5 -march=x86-64 -mtune=generic -mavx2 -O2 and -Os we get essentially 使用clang-3.5 -march=x86-64 -mtune=generic -mavx2 -O2和-Os我们基本上得到了

do_copy:
.LBB0_1:
        vmovaps  (%rsi), %ymm0
        vmovaps  32(%rsi), %ymm1
        vmovaps  64(%rsi), %ymm2
        vmovaps  96(%rsi), %ymm3
        vmovntps %ymm0, (%rdi)
        vmovntps %ymm1, 32(%rdi)
        vmovntps %ymm2, 64(%rdi)
        vmovntps %ymm3, 96(%rdi)
        subq     $-128, %rsi
        subq     $-128, %rdi
        cmpq     %rdx, %rsi
        jb       .LBB0_1
        vzeroupper
        retq

copy:
        cmpq     $128, %rdx
        jb       .LBB1_3
        addq     %rsi, %rdx
.LBB1_2:
        vmovaps  (%rsi), %ymm0
        vmovaps  32(%rsi), %ymm1
        vmovaps  64(%rsi), %ymm2
        vmovaps  96(%rsi), %ymm3
        vmovntps %ymm0, (%rdi)
        vmovntps %ymm1, 32(%rdi)
        vmovntps %ymm2, 64(%rdi)
        vmovntps %ymm3, 96(%rdi)
        subq     $-128, %rsi
        subq     $-128, %rdi
        cmpq     %rdx, %rsi
        jb       .LBB1_2
.LBB1_3:
        vzeroupper
        retq

I like the another.c code (it suits my coding style), and I'm happy with the code generated by GCC-4.8.4 and clang-3.5 at -O1 , -O2 , -O3 , and -Os on both, so I think it is good enough for me. 我喜欢another.c代码（它适合我的编码风格），我对GCC-4.8.4和clang-3.5在-O1 ， -O2 ， -O3和-Os两者上生成的代码感到满意，所以我认为这对我来说已经足够了。 (Note, however, that I haven't actually benchmarked any of this, because I don't have the relevant code. We use both temporal and non-temporal (nt) memory accesses, and cache behaviour (and cache interaction with the surrounding code) is paramount for things like this, so it would make no sense to microbenchmark this, I think.) （但是，请注意，我实际上没有对此进行基准测试，因为我没有相关代码。我们使用时间和非时间（nt）内存访问，以及缓存行为（以及缓存与周围环境的交互）代码）对于像这样的事情是至关重要的，所以我认为微观标记是没有意义的。）

Answer 2

First of all, normal people use gcc -O3 -march=native -S and then edit the .s to test small modifications to compiler output. 首先，普通人使用gcc -O3 -march=native -S然后编辑.s来测试对编译器输出的小修改。 I hope you had fun hex-editing that change. 我希望你有趣的十六进制编辑改变。 :P You could also use Agner Fog's excellent objconv to make disassembly that can be assembled back into a binary with your choice of NASM, YASM, MASM, or AT&T syntax. ：P你也可以使用Agner Fog优秀的objconv进行反汇编，可以通过选择NASM，YASM，MASM或AT＆T语法将其组合成二进制文件。

Using some of the same ideas as Nominal Animal, I made a version that compiles to similarly good asm . 使用与Nominal Animal相同的一些想法，我制作了一个版本，编译成同样好的asm 。 I'm confident about why it compiles to good code though, and I have a guess about why the ordering matters so much: 我有信心为什么它会编译成好的代码，而且我猜测为什么这个顺序非常重要：

CPUs only have a few (~10?) write-combining fill buffers for NT loads / stores . CPU只有少量（~10？）写入组合填充缓冲区用于NT加载/存储。

See this article about copying from video memory with streaming loads, and writing to main memory with streaming stores . 请参阅此文章，了解如何使用流式加载从视频内存进行复制，以及使用流式存储写入主内存。 It's actually faster to bounce the data through a small buffer (much smaller than L1), to avoid having the streaming loads and streaming stores compete for fill buffers (esp. with out-of-order execution). 实际上通过小缓冲区（远小于L1）反弹数据实际上更快，以避免流加载和流存储竞争填充缓冲区（尤其是无序执行）。 Note that using "streaming" NT loads from normal memory is not useful. 请注意，从正常内存使用“流”NT加载是没有用的。 As I understand it, streaming loads are only useful for I/O (including stuff like video RAM, which is mapped into the CPU's address space in an Uncacheable Software-Write-Combining (USWC) region). 据我了解，流加载仅对I / O有用（包括视频RAM，它映射到Uncacheable Software-Write-Combining（USWC）区域中的CPU地址空间）。 Main-memory RAM is mapped WB (Writeback), so the CPU is allowed to speculatively pre-fetch it and cache it, unlike USWC. 主存储器RAM映射为WB（写回），因此允许CPU以推测方式预取它并对其进行缓存，这与USWC不同。 Anyway, so even though I'm linking an article about using streaming loads, I'm not suggesting using streaming loads . 无论如何，即使我正在链接一篇关于使用流媒体加载的文章，我也不建议使用流媒体加载 。 It's just to illustrate that contention for fill buffers is almost certainly the reason that gcc's weird code causes a big problem, where it wouldn't with normal non-NT stores. 这只是为了说明填充缓冲区的争用几乎肯定是gcc奇怪的代码导致一个大问题的原因，而不是正常的非NT存储。

Also see John McAlpin's comment at the end of this thread , as another source confirming that WC stores to multiple cache lines at once can be a big slowdown. 另请参阅John McAlpin在此主题末尾的评论，另一个来源确认WC一次存储到多个缓存行可能会大幅放缓。

gcc's output for your original code (for some braindead reason I can't imagine) stored the 2nd half of the first cacheline, then both halves of the second cacheline, then the 1st half of the first cacheline. gcc的原始代码输出（对于某些我无法想象的脑死亡原因）存储了第一个高速缓存行的第二半，然后是第二个高速缓存行的两半，然后是第一个高速缓存行的第一半。 Probably sometimes the write-combining buffer for the 1st cacheline was getting flushed before both halves were written, resulting in less efficient use of external buses. 可能有时候第一个高速缓存行的写入组合缓冲区在写入两半之前都会被刷新，从而导致外部总线的使用效率降低。

clang doesn't do any weird re-ordering with any of our 3 versions (mine, OP's, and Nominal Animal's). clang没有对我们的3个版本（我的，OP和Nominal Animal的）中的任何一个进行任何奇怪的重新排序。

Anyway, using compiler-only barriers that stop compiler reordering but don't emit a barrier instruction is one way to stop it. 无论如何，使用编译器阻止停止编译器重新排序但不发出屏障指令是阻止它的一种方法。 In this case, it's a way of hitting the compiler over the head and saying "stupid compiler, don't do that". 在这种情况下，它是一种在头部命中编译器并说“愚蠢的编译器，不要那样做”的方法。 I don't think you should normally need to do this everywhere, but clearly you can't trust gcc with write-combining stores (where ordering really matters). 我不认为你通常需要在任何地方都这样做，但显然你不能相信gcc与写合并存储（订购真的很重要）。 So it's probably a good idea to look at the asm at least with the compiler you're developing with when using NT loads and/or stores. 因此，在使用NT加载和/或存储时，至少使用您正在开发的编译器来查看asm可能是个好主意。 I've reported this for gcc . 我已经为gcc报道了这个。 Richard Biener points out that -fno-schedule-insns2 is a sort-of workaround. Richard Biener指出-fno-schedule-insns2是一种解决方法。

Linux (the kernel) already has a barrier() macro that acts as a compiler memory barrier. Linux（内核）已经有一个barrier()宏，它充当编译器内存屏障。 It's almost certainly just a GNU asm volatile("") . 几乎可以肯定它只是一个GNU asm volatile("") 。 Outside of Linux, you can keep using that GNU extension, or you can use the C11 stdatomic.h facilities. 在Linux之外，您可以继续使用该GNU扩展，也可以使用C11 stdatomic.h工具。 They're basically the same as the C++11 std::atomic facilities, with AFAIK identical semantics (thank goodness). 它们与C ++ 11 std::atomic工具基本相同，AFAIK语义相同（谢天谢地）。

I put a barrier between every store, because they're free when there's no useful reordering possible anyway. 我在每家商店之间设置了一道屏障，因为无论如何都没有有用的重新排序，它们是免费的。 It turns out just one barrier inside the loop keeps everything nicely in order, which is what Nominal Animal's answer is doing. 事实证明，循环中只有一个屏障可以很好地保持一切顺序，这就是Nominal Animal的答案所做的。 It doesn't actually disallow the compiler from reordering stores that don't have a barrier separating them; 它实际上并不禁止编译器重新排序没有隔离它们的屏障的商店; the compiler just chose not to. 编译器只是选择不。 This is why I barriered between every store. 这就是我在每家商店之间徘徊的原因。

I only asked the compiler for a write-barrier, because I expect only the ordering of the NT stores matters, not the loads. 我只问编译器写入屏障，因为我希望只有NT存储的顺序才有意义，而不是负载。 Even alternating load and store instructions probably wouldn't matter, since OOO execution pipelines everything anyway. 即使是交替的加载和存储指令也可能无关紧要，因为OOO执行无论如何都会管理所有内容。 (Note that the Intel copy-from-video-mem article even used mfence to avoid overlap between doing streaming stores and streaming loads.) （请注意，英特尔的copy-from-video-mem文章甚至使用了mfence来避免在进行流媒体存储和流式传输加载之间重叠。）

atomic_signal_fence doesn't directly document what all the different memory ordering options do with it. atomic_signal_fence不会直接记录所有不同的内存排序选项。 The C++ page for atomic_thread_fence is the one place on cppreference where there are examples and more on this. atomic_thread_fence的C ++页面是cppreference上的一个位置，其中有一些示例和更多内容。

This is the reason I didn't use Nominal Animal's idea of declaring src as pointer-to-volatile. 这就是我没有使用Nominal Animal将src声明为指向易失性的想法的原因。 gcc decides to keep the loads in the same order as stores. gcc决定以与商店相同的顺序保持负载。

Given that, unrolling only by 2 probably won't make any throughput difference in microbenchmarks, and will save uop cache space in production. 鉴于此，仅展开2可能不会在微基准测试中产生任何吞吐量差异，并将在生产中节省uop缓存空间。 Each iteration would still do a full cache line, which seems good. 每次迭代仍然会执行完整的缓存行，这似乎很好。

SnB-family CPUs can't micro-fuse 2-reg addressing modes , so the obvious way to minimize loop overhead (get pointers to the end of src and dst, and then count a negative index up towards zero) doesn't work. SnB系列CPU不能微融合2-reg寻址模式，因此最小化循环开销（获取指向src和dst结尾的指针，然后将负指数向上计数为零）的明显方法不起作用。 The stores wouldn't micro-fuse. 商店不会微熔。 You'd very quickly fill up the fill-buffers to the point where the extra uops don't matter anyway, though. 尽管如此，你很快就会将填充缓冲区填充到额外的uops无关紧要的程度。 That loop probably runs nowhere near 4 uops per cycle. 该循环可能在每个周期几乎没有接近4个uop。

Still, there is a way to reduce loop overhead: with my ridiculously ugly-and-unreadable-in-C hack to get the compiler to only do one sub (and a cmp/jcc ) as loop overhead, no unrolling at all would make a 4-uop loop that should issue at one iteration per clock even on SnB. 仍然有一种方法可以减少循环开销：使用我可怕的丑陋和不可读的C语言来使编译器只执行一个sub （和cmp/jcc ）作为循环开销，根本不进行展开会使4-uop循环，即使在SnB上也应该在每个时钟的一次迭代中发出。 (Note that vmovntdq is AVX2, while vmovntps is only AVX1. Clang already uses vmovaps / vmovntps for the si256 intrinsics in this code! They have the same alignment requirement, and don't care what bits they store. It doesn't save any insn bytes, only compatibility.) （注意， vmovntdq是AVX2，而vmovntps只是AVX1。在这段代码中，Clang已经使用vmovaps / vmovntps作为si256内在函数！它们具有相同的对齐要求，并不关心它们存储的是什么位。它不保存任何insn字节，只有兼容性。）

See the first paragraph for a godbolt link to this. 请参阅第一段，了解与此相关的一个Godbolt链接。

I guessed you were doing this inside the Linux kernel, so I put in appropriate #ifdef s so this should be correct as kernel code or when compiled for user-space. 我猜你在Linux内核中这样做了，所以我输入了适当的#ifdef所以这应该是正确的内核代码或者为用户空间编译时。

#include <stdint.h>
#include <immintrin.h>

#ifdef __KERNEL__  // linux has it's own macro
//#define compiler_writebarrier()   __asm__ __volatile__ ("")
#define compiler_writebarrier()   barrier()
#else
// Use C11 instead of a GNU extension, for portability to other compilers
#include <stdatomic.h>
// unlike a single store-release, a release barrier is a StoreStore barrier.
// It stops all earlier writes from being delayed past all following stores
// Note that this is still only a compiler barrier, so no SFENCE is emitted,
// even though we're using NT stores.  So from another core's perpsective, our
// stores can become globally out of order.
#define compiler_writebarrier()   atomic_signal_fence(memory_order_release)
// this purposely *doesn't* stop load reordering.  
// In this case gcc loads in the same order it stores, regardless.  load ordering prob. makes much less difference
#endif

void copy_pjc(void *const destination, const void *const source, const size_t bytes)
{
          __m256i *dst  = destination;
    const __m256i *src  = source;
    const __m256i *dst_endp = (destination + bytes); // clang 3.7 goes berserk with intro code with this end condition
        // but with gcc it saves an AND compared to Nominal's bytes/32:

    // const __m256i *dst_endp = dst + bytes/sizeof(*dst); // force the compiler to mask to a round number


    #ifdef __KERNEL__
    kernel_fpu_begin();  // or preferably higher in the call tree, so lots of calls are inside one pair
    #endif

    // bludgeon the compiler into generating loads with two-register addressing modes like [rdi+reg], and stores to [rdi]
    // saves one sub instruction in the loop.
    //#define ADDRESSING_MODE_HACK
    //intptr_t src_offset_from_dst = (src - dst);
    // generates clunky intro code because gcc can't assume void pointers differ by a multiple of 32

    while (dst < dst_endp)  { 
#ifdef ADDRESSING_MODE_HACK
      __m256i m0 = _mm256_load_si256( (dst + src_offset_from_dst) + 0 );
      __m256i m1 = _mm256_load_si256( (dst + src_offset_from_dst) + 1 );
      __m256i m2 = _mm256_load_si256( (dst + src_offset_from_dst) + 2 );
      __m256i m3 = _mm256_load_si256( (dst + src_offset_from_dst) + 3 );
#else
      __m256i m0 = _mm256_load_si256( src + 0 );
      __m256i m1 = _mm256_load_si256( src + 1 );
      __m256i m2 = _mm256_load_si256( src + 2 );
      __m256i m3 = _mm256_load_si256( src + 3 );
#endif

      _mm256_stream_si256( dst+0, m0 );
      compiler_writebarrier();   // even one barrier is enough to stop gcc 5.3 reordering anything
      _mm256_stream_si256( dst+1, m1 );
      compiler_writebarrier();   // but they're completely free because we are sure this store ordering is already optimal
      _mm256_stream_si256( dst+2, m2 );
      compiler_writebarrier();
      _mm256_stream_si256( dst+3, m3 );
      compiler_writebarrier();

      src += 4;
      dst += 4;
    }

  #ifdef __KERNEL__
  kernel_fpu_end();
  #endif

}

It compiles to (gcc 5.3.0 -O3 -march=haswell ): 它编译为（gcc 5.3.0 -O3 -march=haswell ）：

copy_pjc:
        # one insn shorter than Nominal Animal's: doesn't mask the count to a multiple of 32.
        add     rdx, rdi  # dst_endp, destination
        cmp     rdi, rdx  # dst, dst_endp
        jnb     .L7       #,
.L5:
        vmovdqa ymm3, YMMWORD PTR [rsi]   # MEM[base: src_30, offset: 0B], MEM[base: src_30, offset: 0B]
        vmovdqa ymm2, YMMWORD PTR [rsi+32]        # D.26928, MEM[base: src_30, offset: 32B]
        vmovdqa ymm1, YMMWORD PTR [rsi+64]        # D.26928, MEM[base: src_30, offset: 64B]
        vmovdqa ymm0, YMMWORD PTR [rsi+96]        # D.26928, MEM[base: src_30, offset: 96B]
        vmovntdq        YMMWORD PTR [rdi], ymm3 #* dst, MEM[base: src_30, offset: 0B]
        vmovntdq        YMMWORD PTR [rdi+32], ymm2      #, D.26928
        vmovntdq        YMMWORD PTR [rdi+64], ymm1      #, D.26928
        vmovntdq        YMMWORD PTR [rdi+96], ymm0      #, D.26928
        sub     rdi, -128 # dst,
        sub     rsi, -128 # src,
        cmp     rdx, rdi  # dst_endp, dst
        ja      .L5 #,
        vzeroupper
.L7:

Clang makes a very similar loop, but the intro is much longer: clang doesn't assume that src and dest are actually both aligned. Clang做了一个非常相似的循环，但是介绍要长得多：clang并不认为src和dest实际上都是对齐的。 Maybe it doesn't take advantage of the knowledge that the loads and stores will fault if not 32B-aligned? 如果没有32B对齐，它可能没有利用负载和存储将会出错的知识？ (It knows it can use ...aps instructions instead of ...dqa , so it certainly does more compiler-style optimization of intrinsics that gcc (where they more often always turn into the relevant instruction). clang can turn a pair of left/right vector shifts into a mask from a constant, for example.) （它知道它可以使用...aps指令而不是...dqa ，所以它肯定会对gcc（它们更经常总是变成相关指令）的内在函数进行更多编译器式优化...dqa可以转一对例如，左/右向量从常量转换为掩码。）

错误的gcc生成的装配顺序，导致性能损失

问题描述

2 个解决方案

解决方案1
10 2016-02-02 07:41:48

解决方案2
5 2016-02-02 10:43:04

错误的gcc生成的装配顺序，导致性能损失

问题描述

2 个解决方案

解决方案1 10 2016-02-02 07:41:48

解决方案2 5 2016-02-02 10:43:04

解决方案1
10 2016-02-02 07:41:48

解决方案2
5 2016-02-02 10:43:04