[英]Bubble sort slower with -O3 than -O2 with GCC
I made a bubble sort implementation in C, and was testing its performance when I noticed that the -O3
flag made it run even slower than no flags at all!我在 C 中做了一个冒泡排序实现,并在测试它的性能时注意到
-O3
标志使它运行得比没有标志还要慢! Meanwhile -O2
was making it run a lot faster as expected.同时
-O2
使它运行得比预期的快得多。
Without optimisations:没有优化:
time ./sort 30000
./sort 30000 1.82s user 0.00s system 99% cpu 1.816 total
-O2
: -O2
:
time ./sort 30000
./sort 30000 1.00s user 0.00s system 99% cpu 1.005 total
-O3
: -O3
:
time ./sort 30000
./sort 30000 2.01s user 0.00s system 99% cpu 2.007 total
The code:编码:
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <time.h>
int n;
void bubblesort(int *buf)
{
bool changed = true;
for (int i = n; changed == true; i--) { /* will always move at least one element to its rightful place at the end, so can shorten the search by 1 each iteration */
changed = false;
for (int x = 0; x < i-1; x++) {
if (buf[x] > buf[x+1]) {
/* swap */
int tmp = buf[x+1];
buf[x+1] = buf[x];
buf[x] = tmp;
changed = true;
}
}
}
}
int main(int argc, char *argv[])
{
if (argc != 2) {
fprintf(stderr, "Usage: %s <arraysize>\n", argv[0]);
return EXIT_FAILURE;
}
n = atoi(argv[1]);
if (n < 1) {
fprintf(stderr, "Invalid array size.\n");
return EXIT_FAILURE;
}
int *buf = malloc(sizeof(int) * n);
/* init buffer with random values */
srand(time(NULL));
for (int i = 0; i < n; i++)
buf[i] = rand() % n + 1;
bubblesort(buf);
return EXIT_SUCCESS;
}
The assembly language generated for -O2
(from godbolt.org ):为
-O2
生成的汇编语言(来自godbolt.org ):
bubblesort:
mov r9d, DWORD PTR n[rip]
xor edx, edx
xor r10d, r10d
.L2:
lea r8d, [r9-1]
cmp r8d, edx
jle .L13
.L5:
movsx rax, edx
lea rax, [rdi+rax*4]
.L4:
mov esi, DWORD PTR [rax]
mov ecx, DWORD PTR [rax+4]
add edx, 1
cmp esi, ecx
jle .L2
mov DWORD PTR [rax+4], esi
mov r10d, 1
add rax, 4
mov DWORD PTR [rax-4], ecx
cmp r8d, edx
jg .L4
mov r9d, r8d
xor edx, edx
xor r10d, r10d
lea r8d, [r9-1]
cmp r8d, edx
jg .L5
.L13:
test r10b, r10b
jne .L14
.L1:
ret
.L14:
lea eax, [r9-2]
cmp r9d, 2
jle .L1
mov r9d, r8d
xor edx, edx
mov r8d, eax
xor r10d, r10d
jmp .L5
And the same for -O3
:和
-O3
一样:
bubblesort:
mov r9d, DWORD PTR n[rip]
xor edx, edx
xor r10d, r10d
.L2:
lea r8d, [r9-1]
cmp r8d, edx
jle .L13
.L5:
movsx rax, edx
lea rcx, [rdi+rax*4]
.L4:
movq xmm0, QWORD PTR [rcx]
add edx, 1
pshufd xmm2, xmm0, 0xe5
movd esi, xmm0
movd eax, xmm2
pshufd xmm1, xmm0, 225
cmp esi, eax
jle .L2
movq QWORD PTR [rcx], xmm1
mov r10d, 1
add rcx, 4
cmp r8d, edx
jg .L4
mov r9d, r8d
xor edx, edx
xor r10d, r10d
lea r8d, [r9-1]
cmp r8d, edx
jg .L5
.L13:
test r10b, r10b
jne .L14
.L1:
ret
.L14:
lea eax, [r9-2]
cmp r9d, 2
jle .L1
mov r9d, r8d
xor edx, edx
mov r8d, eax
xor r10d, r10d
jmp .L5
It seems like the only significant difference to me is the apparent attempt to use SIMD , which seems like it should be a large improvement, but I also can't tell what on earth it's attempting with those pshufd
instructions... is this just a failed attempt at SIMD?对我来说,唯一显着的区别似乎是使用SIMD的明显尝试,这似乎应该是一个很大的改进,但我也无法判断它到底在用那些
pshufd
指令尝试什么......这只是一个SIMD 尝试失败? Or maybe the couple of extra instructions is just about edging out my instruction cache?或者也许这两条额外的指令只是为了消除我的指令缓存?
Timings were done on an AMD Ryzen 5 3600.计时是在 AMD Ryzen 5 3600 上完成的。
This is a regression in GCC11/12.这是 GCC11/12 中的回归。
GCC10 and earlier were doing separate dword loads, even if it merged for a qword store. GCC10 和更早的版本执行单独的 dword 加载,即使它合并为一个 qword 存储。
It looks like GCC's naïveté about store-forwarding stalls is hurting its auto-vectorization strategy here.看起来 GCC 对 商店转发摊位的天真正在损害其自动矢量化策略。 See also Store forwarding by example for some practical benchmarks on Intel with hardware performance counters, and What are the costs of failed store-to-load forwarding on x86?
另请参阅通过示例存储转发,了解英特尔上带有硬件性能计数器的一些实用基准,以及x86 上失败的存储到加载转发的成本是多少? Also Agner Fog's x86 optimization guides .
还有Agner Fog 的 x86 优化指南。
( gcc -O3
enables -ftree-vectorize
and a few other options not included by -O2
, eg if
-conversion to branchless cmov
, which is another way -O3
can hurt with data patterns GCC didn't expect. By comparison, Clang enables auto-vectorization even at -O2
, although some of its optimizations are still only on at -O3
.) (
gcc -O3
启用-ftree-vectorize
和-O2
不包含的一些其他选项,例如if
-conversion to cmov
,这是-O3
可能会损害 GCC 未预料到的数据模式的另一种方式。相比之下,Clang 启用即使在-O2
也自动矢量化,尽管它的一些优化仍然只在-O3
。)
It's doing 64-bit loads (and branching to store or not) on pairs of ints.它在成对的整数上进行 64 位加载(以及是否分支存储)。 This means, if we swapped the last iteration, this load comes half from that store, half from fresh memory, so we get a store-forwarding stall after every swap .
这意味着,如果我们交换了最后一次迭代,则此负载一半来自该存储,一半来自新内存,因此我们在每次交换后都会遇到存储转发停顿。 But bubble sort often has long chains of swapping every iteration as an element bubbles far, so this is really bad.
但是冒泡排序通常有很长的交换链,因为元素冒泡很远,所以这真的很糟糕。
( Bubble sort is bad in general , especially if implemented naively without keeping the previous iteration's second element around in a register. It can be interesting to analyze the asm details of exactly why it sucks, so it is fair enough for wanting to try.) ( 冒泡排序通常很糟糕,特别是如果天真地实现而没有将先前迭代的第二个元素保留在寄存器中。分析 asm 详细信息以了解其糟糕的确切原因可能很有趣,因此想要尝试是足够公平的。)
Anyway, this is pretty clearly an anti-optimization you should report on GCC Bugzilla with the "missed-optimization" keyword .无论如何,这显然是一种反优化,您应该使用“missed-optimization”关键字报告GCC Bugzilla 。 Scalar loads are cheap, and store-forwarding stalls are costly.
标量负载很便宜,而存储转发停顿的成本很高。 ( Can modern x86 implementations store-forward from more than one prior store? no, nor can microarchitectures other than in-order Atom efficiently load when it partially overlaps with one previous store, and partially from data that has to come from the L1d cache.)
( 现代 x86 实现是否可以从多个先前的存储中存储转发?不,当有序 Atom 与先前的存储部分重叠并且部分来自必须来自 L1d 缓存的数据时,除了有序Atom之外的微架构也不能有效加载。 )
Even better would be to keep buf[x+1]
in a register and use it as buf[x]
in the next iteration, avoiding a store and load.更好的做法是将
buf[x+1]
保存在寄存器中,并在下一次迭代中将其用作buf[x]
,避免存储和加载。 (Like good hand-written asm bubble sort examples, a few of which exist on Stack Overflow.) (就像好的手写 asm 冒泡排序示例一样,其中一些存在于 Stack Overflow 上。)
If it wasn't for the store-forwarding stalls (which AFAIK GCC doesn't know about in its cost model), this strategy might be about break-even.如果不是因为商店转发摊位(AFAIK GCC 在其成本模型中不知道这一点),这种策略可能会达到收支平衡。 SSE 4.1 for a branchless
pmind
/ pmaxd
comparator might be interesting, but that would mean always storing and the C source doesn't do that.用于无
pmind
/ pmaxd
比较器的SSE 4.1 可能很有趣,但这意味着始终存储并且 C 源不这样做。
If this strategy of double-width load had any merit, it would be better implemented with pure integer on a 64-bit machine like x86-64, where you can operate on just the low 32 bits with garbage (or valuable data) in the upper half.如果这种双宽度加载策略有任何优点,最好在 x86-64 这样的 64 位机器上使用纯整数来实现,在这种机器上,您可以只在低 32 位上操作垃圾(或有价值的数据)上半部分。 Eg,
例如,
## What GCC should have done,
## if it was going to use this 64-bit load strategy at all
movsx rax, edx # apparently it wasn't able to optimize away your half-width signed loop counter into pointer math
lea rcx, [rdi+rax*4] # Usually not worth an extra instruction just to avoid an indexed load and indexed store, but let's keep it for easy comparison.
.L4:
mov rax, [rcx] # into RAX instead of XMM0
add edx, 1
# pshufd xmm2, xmm0, 0xe5
# movd esi, xmm0
# movd eax, xmm2
# pshufd xmm1, xmm0, 225
mov rsi, rax
rol rax, 32 # swap halves, just like the pshufd
cmp esi, eax # or eax, esi? I didn't check which is which
jle .L2
movq QWORD PTR [rcx], rax # conditionally store the swapped qword
(Or with BMI2 available from -march=native
, rorx rsi, rax, 32
can copy-and-swap in one uop. Without BMI2, mov
and swapping the original instead of the copy saves latency if running on a CPU without mov-elimination, such as Ice Lake with updated microcode .) (或者使用
-march=native
提供的 BMI2, rorx rsi, rax, 32
可以在一个 uop 中进行复制和交换。没有 BMI2,如果在没有 mov-elimination 的 CPU 上运行,则mov
和交换原始文件而不是副本可以节省延迟,例如带有更新微码的冰湖。)
So total latency from load to compare is just integer load + one ALU operation (rotate).因此,从加载到比较的总延迟只是整数加载 + 一个 ALU 操作(旋转)。 Vs.
比。 XMM load ->
movd
. XMM 加载 ->
movd
。 And its fewer ALU uops.而且它的 ALU 微指令更少。 This does nothing to help with the store-forwarding stall problem, though, which is still a showstopper.
不过,这无助于解决商店转发失速问题,这仍然是个大问题。 This is just an integer SWAR implementation of the same strategy, replacing 2x pshufd and 2x
movd r32, xmm
with just mov
+ rol
.这只是相同策略的整数 SWAR 实现,将 2x pshufd 和 2x movd
movd r32, xmm
替换为mov
+ rol
。
Actually, there's no reason to use 2x pshufd
here.实际上,这里没有理由使用 2x
pshufd
。 Even if using XMM registers, GCC could have done one shuffle that swapped the low two elements, setting up for both the store and movd
.即使使用 XMM 寄存器,GCC 也可以进行一次 shuffle,交换低两个元素,同时设置 store 和
movd
。 So even with XMM regs, this was sub-optimal.因此,即使使用 XMM regs,这也是次优的。 But clearly two different parts of GCC emitted those two
pshufd
instructions;但显然 GCC 的两个不同部分发出了这两个
pshufd
指令; one even printed the shuffle constant in hex while the other used decimal!一个甚至用十六进制打印洗牌常数,而另一个使用十进制! I assume one swapping and the other just trying to get
vec[1]
, the high element of the qword.我假设一个交换,另一个只是试图获得
vec[1]
,qword 的高元素。
slower than no flags at all
比没有标志慢
The default is -O0
, consistent-debugging mode that spills all variables to memory after every C statement , so it's pretty horrible and creates big store-forwarding latency bottlenecks.默认值为
-O0
,一致的调试模式, 在每个 C 语句之后将所有变量溢出到内存中,所以它非常可怕,并且会产生很大的存储转发延迟瓶颈。 (Somewhat like if every variable was volatile
.) But it's successful store forwarding, not stalls, so "only" ~5 cycles, but still much worse than 0 for registers. (有点像如果每个变量都是
volatile
。)但它是成功的存储转发,而不是停顿,所以“只有”~5 个周期,但仍然比寄存器的 0 差得多。 (A few modern microarchitectures including Zen 2 have some special cases that are lower latency ). (包括Zen 2在内的一些现代微架构有一些延迟较低的特殊情况)。 The extra store and load instructions that have to go through the pipeline don't help.
必须通过管道的额外存储和加载指令无济于事。
It's generally not interesting to benchmark -O0
.对
-O0
进行基准测试通常并不有趣。 -O1
or -Og
should be your go-to baseline for the compiler to do the basic amount of optimization a normal person would expect, without anything fancy, but also not intentionally gimp the asm by skipping register allocation. -O1 或
-O1
应该是编译器的-Og
基线,以执行普通人期望的基本优化量,没有任何花哨的东西,但也不会故意通过跳过寄存器分配来削弱 asm。
Semi-related: optimizing bubble sort for size instead of speed can involve memory-destination rotate (creating store-forwarding stalls for back-to-back swaps), or a memory-destination xchg
(implicit lock
prefix -> very slow).半相关:针对大小而不是速度优化冒泡排序可能涉及内存目标旋转(为背靠背交换创建存储转发停顿)或内存目标
xchg
(隐式lock
前缀 -> 非常慢)。 See this Code Golf answer .请参阅此Code Golf答案。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.