c alloca function 的奇怪汇编代码禁用优化 - gcc 使用 DIV 和 IMUL 常量 16，并移位？

Question

I have this simple code in c我在 c 中有这个简单的代码

#include <stdio.h>
#include <alloca.h>

int main()
{
    char* buffer = (char*)alloca(600);
    snprintf(buffer, 600, "Hello %d %d %d\n", 1, 2, 3);

    return 0;
}

I would expect that generated assembly code for alloca function would just decrement stack pointer(one sub instruction), and maybe do some alignments (one and instruction), but the resulting assembly code is very complicated and even more inefficient than you'd expect.我希望为 alloca function 生成的汇编代码只会递减堆栈指针（一个子指令），并且可能会进行一些对齐（一个和指令），但生成的汇编代码非常复杂，甚至比您预期的效率低。

This is the output of objdump -d main.o , on the output of gcc -c (with no optimization, so the default -O0 )这是objdump -d main.o的 output ，在gcc -c的 output 上（没有优化，所以默认-O0 ）

    0000000000400596 <main>:
  400596:   55                      push   %rbp
  400597:   48 89 e5                mov    %rsp,%rbp
  40059a:   48 83 ec 10             sub    $0x10,%rsp
  40059e:   b8 10 00 00 00          mov    $0x10,%eax
  4005a3:   48 83 e8 01             sub    $0x1,%rax
  4005a7:   48 05 60 02 00 00       add    $0x260,%rax
  4005ad:   b9 10 00 00 00          mov    $0x10,%ecx
  4005b2:   ba 00 00 00 00          mov    $0x0,%edx
  4005b7:   48 f7 f1                div    %rcx
  4005ba:   48 6b c0 10             imul   $0x10,%rax,%rax
  4005be:   48 29 c4                sub    %rax,%rsp
  4005c1:   48 89 e0                mov    %rsp,%rax
  4005c4:   48 83 c0 0f             add    $0xf,%rax
  4005c8:   48 c1 e8 04             shr    $0x4,%rax
  4005cc:   48 c1 e0 04             shl    $0x4,%rax
  4005d0:   48 89 45 f8             mov    %rax,-0x8(%rbp)
  4005d4:   48 8b 45 f8             mov    -0x8(%rbp),%rax
  4005d8:   41 b9 03 00 00 00       mov    $0x3,%r9d
  4005de:   41 b8 02 00 00 00       mov    $0x2,%r8d
  4005e4:   b9 01 00 00 00          mov    $0x1,%ecx
  4005e9:   ba a8 06 40 00          mov    $0x4006a8,%edx
  4005ee:   be 58 02 00 00          mov    $0x258,%esi
  4005f3:   48 89 c7                mov    %rax,%rdi
  4005f6:   b8 00 00 00 00          mov    $0x0,%eax
  4005fb:   e8 a0 fe ff ff          callq  4004a0 <snprintf@plt>
  400600:   b8 00 00 00 00          mov    $0x0,%eax
  400605:   c9                      leaveq 
  400606:   c3                      retq   
  400607:   66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
  40060e:   00 00

Any idea what is the aim of this generated assembly code?知道这个生成的汇编代码的目的是什么吗？ I'm using gcc 8.3.1.我正在使用 gcc 8.3.1。

Answer 1

There is of course the usual debug-mode / anti-optimized behaviour of compiling each C statement to a separate block, with non- register variables actually in memory.当然，通常的调试模式/反优化行为是将每个 C 语句编译到单独的块中，实际上在 memory 中具有非register变量。 ( Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? ). （为什么 clang 用 -O0 产生低效的 asm（对于这个简单的浮点和）？）。

But yes, this goes beyond "not optimized".但是，是的，这超出了“未优化”的范围。 No sane person would expect GCC's canned sequence of instructions (or GIMPLE or RTL logic, whatever stage it's expanded) for alloca logic to involve a div by compile-time-constant power of 2, instead of a shift or just an AND.没有理智的人会期望GCC 的指令序列（或 GIMPLE 或 RTL 逻辑，无论它扩展的任何阶段）用于alloca逻辑通过 2 的编译时间常数幂来包含div ，而不是移位或只是一个 AND。 x /= 16; doesn't compile to a div if you write that yourself in C source, even with gcc -O0 .如果您自己在 C 源代码中编写它，即使使用gcc -O0也不会编译为div 。

Normally GCC does compile-time evaluation of constant expressions as much as possible, like x = 5 * 6 won't use imul at runtime.通常 GCC 尽可能对常量表达式进行编译时评估，例如x = 5 * 6在运行时不会使用 imul 。 But the point at which it expands its alloca logic must be after that point, probably pretty late (after most other passes) to explain all those missed optimizations.但是它扩展其alloca逻辑的点必须在那之后，可能很晚（在大多数其他通道之后）来解释所有那些错过的优化。 So it doesn't benefit from the same passes that operate on your C source logic.因此，它不会从在 C 源逻辑上运行的相同通道中受益。

It's doing 2 things:它在做两件事：

round the allocation size up (a constant 600 after it puts that in a register) to a multiple of 16 by doing: ((16ULL - 1) + x) / 16 * 16 .通过执行以下操作将分配大小（将其放入寄存器后为常数600 ）四舍五入为 16 的倍数： ((16ULL - 1) + x) / 16 * 16 。 A sane compiler would at least use right/left shift, if not optimize that to (x+15) & -16 .一个理智的编译器至少会使用右移/左移，如果不将其优化为(x+15) & -16 。 But unfortunately GCC uses div and imul by 16, even though it's a constant power of 2.但不幸的是 GCC 使用div和imul 16，即使它是 2 的恒定幂。
Round the final address of the allocated space to a multiple of 16 (even though it already was because RSP started 16-byte aligned and the allocation size was rounded up.) It does this with ((p+15) >> 4) << 4 which is much more efficient than div/imul (especially for 64-bit operand-size on Intel before Ice Lake), but still less efficient than and $-16, %rax .将分配空间的最终地址四舍五入为 16 的倍数（尽管它已经是因为 RSP 开始 16 字节对齐并且分配大小向上舍入。）它使用((p+15) >> 4) << 4这比 div/imul 效率高得多（尤其是对于 Ice Lake 之前的 Intel 上的 64 位操作数大小），但仍然比and $-16, %rax效率低。 And of course silly to do work that was already pointless.当然，做已经毫无意义的工作也很愚蠢。

Then of course it has to store the pointer into char* buffer .然后当然必须将指针存储到char* buffer中。

And in the block of asm for the next statement, reload it as an arg for sprintf (inefficiently into RAX instead of directly into RDI, typical for gcc -O0 ), along with setting up the register args.在下一条语句的 asm 块中，将其作为sprintf的 arg 重新加载（效率低下到 RAX 中，而不是直接到 RDI 中，典型的是gcc -O0 ），以及设置寄存器 args。

So this sucks a lot, but is very plausibly explained by late expansion of the canned logic for alloca , after most transformation ("optimization") passes have already run .所以这很糟糕，但是在大多数转换（“优化”）通道已经运行之后， alloca的固定逻辑的后期扩展很合理地解释了这一点。 Note that -O0 doesn't literally mean "no optimization" , it just means "compile fast, and give consistent debugging".请注意， -O0 并不是字面意思“无优化” ，它只是表示“快速编译，并提供一致的调试”。

Related:有关的：

How does gcc choose to number temporary variables from -fverbose-asm? gcc如何选择-fverbose-asm中的临时变量编号？ - another discussion of that -O0 alloca asm, with the same guess about expanding it late in GIMPLE passes, or even in RTL. - 另一个关于-O0 alloca asm 的讨论，同样的猜测是在 GIMPLE 通道的后期，甚至在 RTL 中扩展它。 Also has optimized asm for alloca / snprintf which is vastly simpler.还为 alloca / snprintf 优化了 asm，这要简单得多。 In fact that's almost a duplicate;事实上，这几乎是重复的； that question did also ask about the alloca code.该问题也确实询问了alloca代码。
doing seemingly un-needed ops (crackme) - I very lightly commented basically the same asm (for 32-bit mode), but mostly it's discussing hand-obfuscated asm. 做看似不需要的操作 (crackme) - 我非常轻松地评论了基本相同的 asm（对于 32 位模式），但主要是在讨论手动混淆的 asm。
How does GCC implement variable-length arrays? GCC如何实现变长arrays？ shows the 32-bit version of this bad code, but doesn't comment on how much it sucks.显示了这个糟糕代码的 32 位版本，但没有评论它有多糟糕。

c alloca function 的奇怪汇编代码禁用优化 - gcc 使用 DIV 和 IMUL 常量 16，并移位？

问题描述

1 个解决方案

解决方案1
4 已采纳 2021-02-20 09:18:14

c alloca function 的奇怪汇编代码禁用优化 - gcc 使用 DIV 和 IMUL 常量 16，并移位？

问题描述

1 个解决方案

解决方案1 4 已采纳 2021-02-20 09:18:14

解决方案1
4 已采纳 2021-02-20 09:18:14