使用 -O3 的冒泡排序比使用 GCC 的 -O2 慢

Question

I made a bubble sort implementation in C, and was testing its performance when I noticed that the -O3 flag made it run even slower than no flags at all!我在 C 中做了一个冒泡排序实现，并在测试它的性能时注意到-O3标志使它运行得比没有标志还要慢！ Meanwhile -O2 was making it run a lot faster as expected.同时-O2使它运行得比预期的快得多。

Without optimisations:没有优化：

time ./sort 30000

./sort 30000  1.82s user 0.00s system 99% cpu 1.816 total

-O2 : -O2 ：

time ./sort 30000

./sort 30000  1.00s user 0.00s system 99% cpu 1.005 total

-O3 : -O3 ：

time ./sort 30000

./sort 30000  2.01s user 0.00s system 99% cpu 2.007 total

The code:编码：

#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <time.h>

int n;

void bubblesort(int *buf)
{
    bool changed = true;
    for (int i = n; changed == true; i--) { /* will always move at least one element to its rightful place at the end, so can shorten the search by 1 each iteration */
        changed = false;

        for (int x = 0; x < i-1; x++) {
            if (buf[x] > buf[x+1]) {
                /* swap */
                int tmp = buf[x+1];
                buf[x+1] = buf[x];
                buf[x] = tmp;

                changed = true;
            }
        }
    }
}

int main(int argc, char *argv[])
{
    if (argc != 2) {
        fprintf(stderr, "Usage: %s <arraysize>\n", argv[0]);
        return EXIT_FAILURE;
    }

    n = atoi(argv[1]);
    if (n < 1) {
        fprintf(stderr, "Invalid array size.\n");
        return EXIT_FAILURE;
    }

    int *buf = malloc(sizeof(int) * n);

    /* init buffer with random values */
    srand(time(NULL));
    for (int i = 0; i < n; i++)
        buf[i] = rand() % n + 1;

    bubblesort(buf);

    return EXIT_SUCCESS;
}

The assembly language generated for -O2 (from godbolt.org ):为-O2生成的汇编语言（来自godbolt.org ）：

bubblesort:
        mov     r9d, DWORD PTR n[rip]
        xor     edx, edx
        xor     r10d, r10d
.L2:
        lea     r8d, [r9-1]
        cmp     r8d, edx
        jle     .L13
.L5:
        movsx   rax, edx
        lea     rax, [rdi+rax*4]
.L4:
        mov     esi, DWORD PTR [rax]
        mov     ecx, DWORD PTR [rax+4]
        add     edx, 1
        cmp     esi, ecx
        jle     .L2
        mov     DWORD PTR [rax+4], esi
        mov     r10d, 1
        add     rax, 4
        mov     DWORD PTR [rax-4], ecx
        cmp     r8d, edx
        jg      .L4
        mov     r9d, r8d
        xor     edx, edx
        xor     r10d, r10d
        lea     r8d, [r9-1]
        cmp     r8d, edx
        jg      .L5
.L13:
        test    r10b, r10b
        jne     .L14
.L1:
        ret
.L14:
        lea     eax, [r9-2]
        cmp     r9d, 2
        jle     .L1
        mov     r9d, r8d
        xor     edx, edx
        mov     r8d, eax
        xor     r10d, r10d
        jmp     .L5

And the same for -O3 :和-O3一样：

bubblesort:
        mov     r9d, DWORD PTR n[rip]
        xor     edx, edx
        xor     r10d, r10d
.L2:
        lea     r8d, [r9-1]
        cmp     r8d, edx
        jle     .L13
.L5:
        movsx   rax, edx
        lea     rcx, [rdi+rax*4]
.L4:
        movq    xmm0, QWORD PTR [rcx]
        add     edx, 1
        pshufd  xmm2, xmm0, 0xe5
        movd    esi, xmm0
        movd    eax, xmm2
        pshufd  xmm1, xmm0, 225
        cmp     esi, eax
        jle     .L2
        movq    QWORD PTR [rcx], xmm1
        mov     r10d, 1
        add     rcx, 4
        cmp     r8d, edx
        jg      .L4
        mov     r9d, r8d
        xor     edx, edx
        xor     r10d, r10d
        lea     r8d, [r9-1]
        cmp     r8d, edx
        jg      .L5
.L13:
        test    r10b, r10b
        jne     .L14
.L1:
        ret
.L14:
        lea     eax, [r9-2]
        cmp     r9d, 2
        jle     .L1
        mov     r9d, r8d
        xor     edx, edx
        mov     r8d, eax
        xor     r10d, r10d
        jmp     .L5

It seems like the only significant difference to me is the apparent attempt to use SIMD , which seems like it should be a large improvement, but I also can't tell what on earth it's attempting with those pshufd instructions... is this just a failed attempt at SIMD?对我来说，唯一显着的区别似乎是使用SIMD的明显尝试，这似乎应该是一个很大的改进，但我也无法判断它到底在用那些pshufd指令尝试什么......这只是一个SIMD 尝试失败？ Or maybe the couple of extra instructions is just about edging out my instruction cache?或者也许这两条额外的指令只是为了消除我的指令缓存？

Timings were done on an AMD Ryzen 5 3600.计时是在 AMD Ryzen 5 3600 上完成的。

Answer 1

This is a regression in GCC11/12.这是 GCC11/12 中的回归。
GCC10 and earlier were doing separate dword loads, even if it merged for a qword store. GCC10 和更早的版本执行单独的 dword 加载，即使它合并为一个 qword 存储。

It looks like GCC's naïveté about store-forwarding stalls is hurting its auto-vectorization strategy here.看起来 GCC 对商店转发摊位的天真正在损害其自动矢量化策略。 See also Store forwarding by example for some practical benchmarks on Intel with hardware performance counters, and What are the costs of failed store-to-load forwarding on x86?另请参阅通过示例存储转发，了解英特尔上带有硬件性能计数器的一些实用基准，以及x86 上失败的存储到加载转发的成本是多少？ Also Agner Fog's x86 optimization guides .还有Agner Fog 的 x86 优化指南。

( gcc -O3 enables -ftree-vectorize and a few other options not included by -O2 , eg if -conversion to branchless cmov , which is another way -O3 can hurt with data patterns GCC didn't expect. By comparison, Clang enables auto-vectorization even at -O2 , although some of its optimizations are still only on at -O3 .) （ gcc -O3启用-ftree-vectorize和-O2不包含的一些其他选项，例如if -conversion to cmov ，这是-O3可能会损害 GCC 未预料到的数据模式的另一种方式。相比之下，Clang 启用即使在-O2也自动矢量化，尽管它的一些优化仍然只在-O3 。）

It's doing 64-bit loads (and branching to store or not) on pairs of ints.它在成对的整数上进行 64 位加载（以及是否分支存储）。 This means, if we swapped the last iteration, this load comes half from that store, half from fresh memory, so we get a store-forwarding stall after every swap .这意味着，如果我们交换了最后一次迭代，则此负载一半来自该存储，一半来自新内存，因此我们在每次交换后都会遇到存储转发停顿。 But bubble sort often has long chains of swapping every iteration as an element bubbles far, so this is really bad.但是冒泡排序通常有很长的交换链，因为元素冒泡很远，所以这真的很糟糕。

( Bubble sort is bad in general , especially if implemented naively without keeping the previous iteration's second element around in a register. It can be interesting to analyze the asm details of exactly why it sucks, so it is fair enough for wanting to try.) （冒泡排序通常很糟糕，特别是如果天真地实现而没有将先前迭代的第二个元素保留在寄存器中。分析 asm 详细信息以了解其糟糕的确切原因可能很有趣，因此想要尝试是足够公平的。）

Anyway, this is pretty clearly an anti-optimization you should report on GCC Bugzilla with the "missed-optimization" keyword .无论如何，这显然是一种反优化，您应该使用“missed-optimization”关键字报告GCC Bugzilla 。 Scalar loads are cheap, and store-forwarding stalls are costly.标量负载很便宜，而存储转发停顿的成本很高。 ( Can modern x86 implementations store-forward from more than one prior store? no, nor can microarchitectures other than in-order Atom efficiently load when it partially overlaps with one previous store, and partially from data that has to come from the L1d cache.) （ 现代 x86 实现是否可以从多个先前的存储中存储转发？不，当有序 Atom 与先前的存储部分重叠并且部分来自必须来自 L1d 缓存的数据时，除了有序Atom之外的微架构也不能有效加载。 )

Even better would be to keep buf[x+1] in a register and use it as buf[x] in the next iteration, avoiding a store and load.更好的做法是将buf[x+1]保存在寄存器中，并在下一次迭代中将其用作buf[x] ，避免存储和加载。 (Like good hand-written asm bubble sort examples, a few of which exist on Stack Overflow.) （就像好的手写 asm 冒泡排序示例一样，其中一些存在于 Stack Overflow 上。）

If it wasn't for the store-forwarding stalls (which AFAIK GCC doesn't know about in its cost model), this strategy might be about break-even.如果不是因为商店转发摊位（AFAIK GCC 在其成本模型中不知道这一点），这种策略可能会达到收支平衡。 SSE 4.1 for a branchless pmind / pmaxd comparator might be interesting, but that would mean always storing and the C source doesn't do that.用于无pmind / pmaxd比较器的SSE 4.1 可能很有趣，但这意味着始终存储并且 C 源不这样做。

If this strategy of double-width load had any merit, it would be better implemented with pure integer on a 64-bit machine like x86-64, where you can operate on just the low 32 bits with garbage (or valuable data) in the upper half.如果这种双宽度加载策略有任何优点，最好在 x86-64 这样的 64 位机器上使用纯整数来实现，在这种机器上，您可以只在低 32 位上操作垃圾（或有价值的数据）上半部分。 Eg,例如，

## What GCC should have done,
## if it was going to use this 64-bit load strategy at all

        movsx   rax, edx           # apparently it wasn't able to optimize away your half-width signed loop counter into pointer math
        lea     rcx, [rdi+rax*4]   # Usually not worth an extra instruction just to avoid an indexed load and indexed store, but let's keep it for easy comparison.
.L4:
        mov     rax, [rcx]       # into RAX instead of XMM0
        add     edx, 1
            #  pshufd  xmm2, xmm0, 0xe5
            #  movd    esi, xmm0
            #  movd    eax, xmm2
            #  pshufd  xmm1, xmm0, 225
        mov     rsi, rax
        rol     rax, 32   # swap halves, just like the pshufd
        cmp     esi, eax  # or eax, esi?  I didn't check which is which
        jle     .L2
        movq    QWORD PTR [rcx], rax   # conditionally store the swapped qword

(Or with BMI2 available from -march=native , rorx rsi, rax, 32 can copy-and-swap in one uop. Without BMI2, mov and swapping the original instead of the copy saves latency if running on a CPU without mov-elimination, such as Ice Lake with updated microcode .) （或者使用-march=native提供的 BMI2， rorx rsi, rax, 32可以在一个 uop 中进行复制和交换。没有 BMI2，如果在没有 mov-elimination 的 CPU 上运行，则mov和交换原始文件而不是副本可以节省延迟，例如带有更新微码的冰湖。）

So total latency from load to compare is just integer load + one ALU operation (rotate).因此，从加载到比较的总延迟只是整数加载 + 一个 ALU 操作（旋转）。 Vs.比。 XMM load -> movd . XMM 加载 -> movd 。 And its fewer ALU uops.而且它的 ALU 微指令更少。 This does nothing to help with the store-forwarding stall problem, though, which is still a showstopper.不过，这无助于解决商店转发失速问题，这仍然是个大问题。 This is just an integer SWAR implementation of the same strategy, replacing 2x pshufd and 2x movd r32, xmm with just mov + rol .这只是相同策略的整数 SWAR 实现，将 2x pshufd 和 2x movd movd r32, xmm替换为mov + rol 。

Actually, there's no reason to use 2x pshufd here.实际上，这里没有理由使用 2x pshufd 。 Even if using XMM registers, GCC could have done one shuffle that swapped the low two elements, setting up for both the store and movd .即使使用 XMM 寄存器，GCC 也可以进行一次 shuffle，交换低两个元素，同时设置 store 和movd 。 So even with XMM regs, this was sub-optimal.因此，即使使用 XMM regs，这也是次优的。 But clearly two different parts of GCC emitted those two pshufd instructions;但显然 GCC 的两个不同部分发出了这两个pshufd指令； one even printed the shuffle constant in hex while the other used decimal!一个甚至用十六进制打印洗牌常数，而另一个使用十进制！ I assume one swapping and the other just trying to get vec[1] , the high element of the qword.我假设一个交换，另一个只是试图获得vec[1] ，qword 的高元素。

slower than no flags at all比没有标志慢

The default is -O0 , consistent-debugging mode that spills all variables to memory after every C statement , so it's pretty horrible and creates big store-forwarding latency bottlenecks.默认值为-O0 ，一致的调试模式，在每个 C 语句之后将所有变量溢出到内存中，所以它非常可怕，并且会产生很大的存储转发延迟瓶颈。 (Somewhat like if every variable was volatile .) But it's successful store forwarding, not stalls, so "only" ~5 cycles, but still much worse than 0 for registers. （有点像如果每个变量都是volatile 。）但它是成功的存储转发，而不是停顿，所以“只有”~5 个周期，但仍然比寄存器的 0 差得多。 (A few modern microarchitectures including Zen 2 have some special cases that are lower latency ). （包括Zen 2在内的一些现代微架构有一些延迟较低的特殊情况）。 The extra store and load instructions that have to go through the pipeline don't help.必须通过管道的额外存储和加载指令无济于事。

It's generally not interesting to benchmark -O0 .对-O0进行基准测试通常并不有趣。 -O1 or -Og should be your go-to baseline for the compiler to do the basic amount of optimization a normal person would expect, without anything fancy, but also not intentionally gimp the asm by skipping register allocation. -O1 或-O1应该是编译器的-Og基线，以执行普通人期望的基本优化量，没有任何花哨的东西，但也不会故意通过跳过寄存器分配来削弱 asm。

Semi-related: optimizing bubble sort for size instead of speed can involve memory-destination rotate (creating store-forwarding stalls for back-to-back swaps), or a memory-destination xchg (implicit lock prefix -> very slow).半相关：针对大小而不是速度优化冒泡排序可能涉及内存目标旋转（为背靠背交换创建存储转发停顿）或内存目标xchg （隐式lock前缀 -> 非常慢）。 See this Code Golf answer .请参阅此Code Golf答案。

使用 -O3 的冒泡排序比使用 GCC 的 -O2 慢

问题描述

1 个解决方案

解决方案1
176 已采纳 2021-10-09 03:09:30

使用 -O3 的冒泡排序比使用 GCC 的 -O2 慢

问题描述

1 个解决方案

解决方案1 176 已采纳 2021-10-09 03:09:30

解决方案1
176 已采纳 2021-10-09 03:09:30