为什么 clang 使用 -O0 产生低效的 asm（对于这个简单的浮点和）？

Question

I am disassembling this code on llvm clang Apple LLVM version 8.0.0 (clang-800.0.42.1):我正在 llvm clang Apple LLVM 版本 8.0.0 (clang-800.0.42.1) 上反汇编此代码：

int main() {
    float a=0.151234;
    float b=0.2;
    float c=a+b;
    printf("%f", c);
}

I compiled with no -O specifications, but I also tried with -O0 (gives the same) and -O2 (actually computes the value and stores it precomputed)我编译时没有使用 -O 规范，但我也尝试使用 -O0（给出相同的值）和 -O2（实际上计算值并存储它预先计算的值）

The resulting disassembly is the following (I removed the parts that are not relevant)由此产生的拆卸如下（我删除了不相关的部分）

->  0x100000f30 <+0>:  pushq  %rbp
    0x100000f31 <+1>:  movq   %rsp, %rbp
    0x100000f34 <+4>:  subq   $0x10, %rsp
    0x100000f38 <+8>:  leaq   0x6d(%rip), %rdi       
    0x100000f3f <+15>: movss  0x5d(%rip), %xmm0           
    0x100000f47 <+23>: movss  0x59(%rip), %xmm1        
    0x100000f4f <+31>: movss  %xmm1, -0x4(%rbp)  
    0x100000f54 <+36>: movss  %xmm0, -0x8(%rbp)
    0x100000f59 <+41>: movss  -0x4(%rbp), %xmm0         
    0x100000f5e <+46>: addss  -0x8(%rbp), %xmm0
    0x100000f63 <+51>: movss  %xmm0, -0xc(%rbp)
    ...

Apparently it's doing the following:显然它正在执行以下操作：

loading the two floats onto registers xmm0 and xmm1将两个浮点数加载到寄存器 xmm0 和 xmm1
put them in the stack把它们放在堆栈中
load one value (not the one xmm0 had earlier) from the stack to xmm0从堆栈中加载一个值（不是之前的 xmm0）到 xmm0
perform the addition.执行加法。
store the result back to the stack.将结果存回堆栈。

I find it inefficient because:我发现它效率低下，因为：

Everything can be done in registry.一切都可以在注册表中完成。 I am not using a and b later, so it could just skip any operation involving the stack.我稍后不再使用 a 和 b，因此它可以跳过任何涉及堆栈的操作。
even if it wanted to use the stack, it could save reloading xmm0 from the stack if it did the operation with a different order.即使它想使用堆栈，如果它以不同的顺序执行操作，它也可以避免从堆栈重新加载 xmm0。

Given that the compiler is always right, why did it choose this strategy?既然编译器永远是对的，那它为什么选择这种策略呢？

Answer 1

-O0 (unoptimized) is the default . -O0 （未优化）是默认值。 It tells the compiler you want it to compile fast (short compile times), not to take extra time compiling to make efficient code.它告诉编译器您希望它快速编译（编译时间短），而不是花费额外的时间来编译以生成高效的代码。

( -O0 isn't literally no optimization; eg gcc will still eliminate code inside if(1 == 2){ } blocks. Especially gcc more than most other compilers still does things like use multiplicative inverses for division at -O0 , because it still transforms your C source through multiple internal representations of the logic before eventually emitting asm.) （ -O0并不是字面上没有优化；例如 gcc 仍然会消除if(1 == 2){ }块中的代码。尤其是 gcc 比大多数其他编译器仍然执行诸如在-O0处使用乘法逆除法之类的事情，因为它在最终发出 asm 之前，仍然通过逻辑的多个内部表示来转换您的 C 源代码。）

Plus, "the compiler is always right" is an exaggeration even at -O3 .另外，即使在-O3处，“编译器总是正确的”也是夸大其词。 Compilers are very good at a large scale, but minor missed-optimizations are still common within single loops.编译器在大规模方面非常出色，但在单个循环中仍会出现轻微的遗漏优化。 Often with very low impact, but wasted instructions (or uops) in a loop can eat up space in the out-of-order execution reordering window, and be less hyper-threading friendly when sharing a core with another thread.通常具有非常低的影响，但循环中浪费的指令（或 uops）会占用乱序执行重新排序窗口中的空间，并且在与另一个线程共享一个内核时对超线程不那么友好。 See C++ code for testing the Collatz conjecture faster than hand-written assembly - why?请参阅C++ 代码以比手写程序集更快地测试 Collatz 猜想 - 为什么？ for more about beating the compiler in a simple specific case.有关在简单的特定情况下击败编译器的更多信息。

More importantly, -O0 also implies treating all variables similar to volatile for consistent debugging .更重要的是， -O0还意味着将所有类似于volatile变量处理为一致的调试。 ie so you can set a breakpoint or single step and modify the value of a C variable, and then continue execution and have the program work the way you'd expect from your C source running on the C abstract machine.即这样您就可以设置断点或单步并修改C 变量的值，然后继续执行并使程序按照您期望的方式工作，因为您的 C 源代码在 C 抽象机上运行。 So the compiler can't do any constant-propagation or value-range simplification.所以编译器不能做任何常数传播或值范围的简化。 (eg an integer that's known to be non-negative can simplify things using it, or make some if conditions always true or always false.) （例如，一个已知为非负的整数可以简化使用它的事情，或者使一些 if 条件始终为真或始终为假。）

(It's not quite as bad as volatile : multiple references to the same variable within one statement don't always result in multiple loads; at -O0 compilers will still optimize somewhat within a single expression.) （这不是那么糟糕，很为volatile ：一个语句中对同一变量多次引用并不总是导致多个负载;在-O0编译器依然会有所优化一个表达式中。）

Compilers have to specifically anti-optimize for -O0 by storing/reloading all variables to their memory address between statements .编译器必须通过在 statements 之间存储/重新加载所有变量到它们的内存地址来专门针对-O0进行反优化。 (In C and C++, every variable has an address unless it was declared with the (now obsolete) register keyword and has never had its address taken. Optimizing away the address is possible according to the as-if rule for other variables, but isn't done at -O0 ) （在 C 和 C++ 中，每个变量都有一个地址，除非它是用（现在已经过时的） register关键字声明的，并且从未被占用。根据其他变量的 as-if 规则优化地址是可能的，但不是没有在-O0完成）

Unfortunately, debug-info formats can't track the location of a variable through registers, so fully consistent debugging isn't possible without this slow-and-stupid code-gen.不幸的是，调试信息格式无法通过寄存器跟踪变量的位置，因此如果没有这种缓慢而愚蠢的代码生成，就不可能实现完全一致的调试。

If you don't need this, you can compile with -Og for light optimization, and without the anti-optimizations required for consistent debugging.如果您不需要这个，您可以使用-Og进行编译以进行轻度优化，而无需进行一致调试所需的反优化。 The GCC manual recommends it for the usual edit/compile/run cycle, but you will get "optimized out" for many local variables with automatic storage when debugging. GCC 手册建议将其用于通常的编辑/编译/运行周期，但您将在调试时为许多具有自动存储的局部变量“优化”。 Globals and function args still usually have their actual values, at least at function boundaries.全局变量和函数参数通常仍然具有它们的实际值，至少在函数边界处是这样。

Even worse, -O0 makes code that still works even if you use GDB's jump command to continue execution at a different source line .更糟糕的是，即使您使用 GDB 的jump命令在不同的源代码行继续执行， -O0会使代码仍然有效。 So each C statement has to be compiled into a fully independent block of instructions.因此，每个 C 语句都必须编译成一个完全独立的指令块。 ( Is it possible to "jump"/"skip" in GDB debugger? ) （是否可以在 GDB 调试器中“跳转”/“跳过”？）

for() loops can't be transformed into idiomatic (for asm) do{}while() loops , and other restrictions. for()循环不能转换为惯用的（for asm） do{}while()循环和其他限制。

For all the above reasons, (micro-)benchmarking un-optimized code is a huge waste of time;由于上述所有原因， （微）基准测试未优化的代码是一种巨大的时间浪费； the results depend on silly details of how you wrote the source that don't matter when you compile with normal optimization.结果取决于您如何编写源代码的愚蠢细节，当您使用正常优化进行编译时，这些细节无关紧要。 -O0 vs. -O3 performance is not linearly related; -O0与-O3性能不是线性相关的； some code will speed up much more than others .某些代码的速度会比其他代码快得多。

The bottlenecks in -O0 code will often be different from -O3 - often on a loop counter that's kept in memory, creating a ~6-cycle loop-carried dependency chain. -O0代码中的瓶颈通常与-O3不同 - 通常在保存在内存中的循环计数器上，创建一个 ~6 周期循环携带的依赖链。 This can create interesting effects in the compiler-generated asm like Adding a redundant assignment speeds up code when compiled without optimization (which are interesting from an asm perspective, but not for C.)这可以在编译器生成的 asm 中创建有趣的效果，例如在没有优化的情况下编译时添加冗余赋值可以加快代码的速度（从 asm 的角度来看这很有趣，但对于 C 则不然。）

"My benchmark optimized away otherwise" is not a valid justification for looking at the performance of -O0 code. “否则我的基准测试优化了”不是查看-O0代码性能的有效理由。 See C loop optimization help for final assignment for an example and more details about the rabbit hole that tuning for -O0 is.有关示例以及有关调整-O0的兔子洞的更多详细信息，请参阅最终分配的 C 循环优化帮助。

Getting interesting compiler output获得有趣的编译器输出

If you want to see how the compiler adds 2 variables, write a function that takes args and returns a value .如果您想查看编译器如何添加 2 个变量，请编写一个接受 args 并返回一个 value 的函数。 Remember you only want to look at the asm, not run it, so you don't need a main or any numeric literal values for anything that should be a runtime variable.请记住，您只想查看 asm，而不是运行它，因此对于任何应该是运行时变量的内容，您都不需要main或任何数字文字值。

See also How to remove "noise" from GCC/clang assembly output?另请参阅如何从 GCC/clang 程序集输出中去除“噪音”？ for more about this.有关更多信息。

float foo(float a, float b) {
    float c=a+b;
    return c;
}

compiles with clang -O3 ( on the Godbolt compiler explorer ) to the expected使用clang -O3 （在 Godbolt 编译器资源管理器上）编译为预期的

    addss   xmm0, xmm1
    ret

But with -O0 it spills the args to stack memory.但是使用-O0它将参数溢出到堆栈内存。 (Godbolt uses debug info emitted by the compiler to colour-code asm instructions according to which C statement they came from. I've added line breaks to show blocks for each statement, but you can see this with colour highlighting on the Godbolt link above. Often very handy for finding the interesting part of an inner loop in optimized compiler output.) （Godbolt 使用编译器发出的调试信息根据它们来自哪个 C 语句对 asm 指令进行颜色编码。我添加了换行符以显示每个语句的块，但是您可以在上面的 Godbolt 链接上看到带有颜色突出显示的内容. 在优化的编译器输出中找到内循环的有趣部分通常非常方便。）

gcc -fverbose-asm will put comments on every line showing the operand names as C vars. gcc -fverbose-asm将在每一行上添加注释，将操作数名称显示为 C 变量。 In optimized code that's often an internal tmp name, but in un-optimized code it's usual an actual variable from the C source.在优化代码中，通常是内部 tmp 名称，但在未优化代码中，它通常是来自 C 源代码的实际变量。 I've manually commented the clang output because it doesn't do that.我已经手动评论了 clang 输出，因为它没有这样做。

# clang7.0 -O0  also on Godbolt
foo:
    push    rbp
    mov     rbp, rsp                  # make a traditional stack frame
    movss   DWORD PTR [rbp-20], xmm0  # spill the register args
    movss   DWORD PTR [rbp-24], xmm1  # into the red zone (below RSP)

    movss   xmm0, DWORD PTR [rbp-20]  # a
    addss   xmm0, DWORD PTR [rbp-24]  # +b
    movss   DWORD PTR [rbp-4], xmm0   # store c

    movss   xmm0, DWORD PTR [rbp-4]   # return 0
    pop     rbp                       # epilogue
    ret

Fun fact: using register float c = a+b;有趣的事实：使用register float c = a+b; , the return value can stay in XMM0 between statements, instead of being spilled/reloaded. ，返回值可以在语句之间保留在 XMM0 中，而不是被溢出/重新加载。 The variable has no address.变量没有地址。 (I included that version of the function in the Godbolt link.) （我在 Godbolt 链接中包含了该版本的功能。）

The register keyword has no effect in optimized code (except making it an error to take a variable's address, like how const on a local stops you from accidentally modifying something). register关键字在优化代码中没有影响（除了使获取变量的地址成为错误，例如本地上的const如何阻止您意外修改某些内容）。 I don't recommend using it, but it's interesting to see that it does actually affect un-optimized code.我不建议使用它，但有趣的是它确实会影响未优化的代码。

Related:有关的：

Complex compiler output for simple constructor - every copy of a variable when passing args typically results in extra copies in the asm. 简单构造函数的复杂编译器输出- 传递 args 时变量的每个副本通常会导致 asm 中的额外副本。
Why is this C++ wrapper class not being inlined away? 为什么这个 C++ 包装类没有被内联？ __attribute__((always_inline)) can force inlining, but doesn't optimize away the copying to create the function args, let alone optimize the function into the caller. __attribute__((always_inline))可以强制内联，但不会优化复制以创建函数 args，更不用说将函数优化到调用者中了。

为什么 clang 使用 -O0 产生低效的 asm（对于这个简单的浮点和）？

问题描述

1 个解决方案

解决方案1
25 2018-11-18 23:34:01

Getting interesting compiler output获得有趣的编译器输出

Related:有关的：

为什么 clang 使用 -O0 产生低效的 asm（对于这个简单的浮点和）？

问题描述

1 个解决方案

解决方案1 25 2018-11-18 23:34:01

Getting interesting compiler output获得有趣的编译器输出

Related:有关的：

解决方案1
25 2018-11-18 23:34:01