为什么 Rust 堆栈框架如此之大？

Question

I encountered an unexpectedly early stack overflow and created the following program to test the issue:我遇到了意外的早期堆栈溢出并创建了以下程序来测试该问题：

#![feature(asm)]
#[inline(never)]
fn get_rsp() -> usize {
    let rsp: usize;
    unsafe {
        asm! {
            "mov {}, rsp",
            out(reg) rsp
        }
    }
    rsp
}

fn useless_function(x: usize) {
    if x > 0 {
        println!("{:x}", get_rsp());
        useless_function(x - 1);
    }
}

fn main() {
    useless_function(10);
}

This is get_rsp disassembled (according to cargo-asm ):这是get_rsp反汇编（根据cargo-asm ）：

tests::get_rsp:
 push    rax
 #APP
 mov     rax, rsp
 #NO_APP
 pop     rcx
 ret

I'm not sure what #APP and #NO_APP do or why rax is pushed and then popped into rcx , but it seems the function does return the stack pointer.我不确定#APP和#NO_APP做什么或为什么rax被推送然后弹出到rcx ，但似乎该函数确实返回了堆栈指针。

I was surprised to find that in debug mode, the difference between two consecutively printed rsp was 192(!) and even in release mode it was 128. As far as I understand, all that needs to be stored for each call to useless_function is one usize and a return address, so I'd expect every stack frame to be around 16 bytes large.我很惊讶地发现在调试模式下，两个连续打印的rsp之间的差异是 192(!)，甚至在发布模式下也是 128。据我所知，每次调用useless_function需要存储的只是一个usize和一个返回地址，所以我希望每个堆栈帧大约 16 字节大。

I'm running this with rustc 1.46.0 on a 64-bit Windows machine.我在 64 位 Windows 机器上使用rustc 1.46.0运行它。

Are my results consistent across machine?我的结果在机器上是否一致？ How is this explained?这是如何解释的？

It seems that the use of println!好象是使用了println! has a pretty significant effect.有相当显着的效果。 In an attempt to avoid that, I changed the program (Thanks to @Shepmaster for the idea) to store the values in a static array:为了避免这种情况，我更改了程序（感谢@Shepmaster 的想法）以将值存储在静态数组中：

static mut RSPS: [usize; 10] = [0; 10];

#[inline(never)]
fn useless_function(x: usize) {
    unsafe { RSPS[x] = get_rsp() };
    if x == 0 {
        return;
    }
    useless_function(x - 1);
}

fn main() {
    useless_function(9);
    println!("{:?}", unsafe { RSPS });
}

The recursion gets optimised away in release mode, but in debug mode each frame still takes 80 bytes which is way more than I anticipated.递归在发布模式下得到优化，但在调试模式下，每帧仍然需要 80 个字节，这比我预期的要多得多。 Is this just the way stack frames work on x86?这只是堆栈帧在 x86 上的工作方式吗？ Do other languages do better?其他语言做得更好吗？ This seems a little inefficient.这似乎有点低效。

Answer 1

Using formatting machinery like println!使用像println!这样的格式化机制println! creates a number of things on the stack.在堆栈上创建了许多东西。 Expanding the macros used in your code:扩展代码中使用的宏：

fn useless_function(x: usize) {
    if x > 0 {
        {
            ::std::io::_print(::core::fmt::Arguments::new_v1(
                &["", "\n"],
                &match (&get_rsp(),) {
                    (arg0,) => [::core::fmt::ArgumentV1::new(
                        arg0,
                        ::core::fmt::LowerHex::fmt,
                    )],
                },
            ));
        };
        useless_function(x - 1);
    }
}

I believe that those structs consume the majority of the space.我相信这些结构占用了大部分空间。 As an attempt to prove that, I printed the size of the value created by format_args , which is used by println!为了证明这一点，我打印了format_args创建的值的大小，它被println! : ：

let sz = std::mem::size_of_val(&format_args!("{:x}", get_rsp()));
println!("{}", sz);

This shows that it is 48 bytes.这表明它是48个字节。

See also:也可以看看：

How do I see the expanded macro code that's causing my compile error?如何查看导致编译错误的扩展宏代码？

Something like this should remove the printing from the equation, but the compiler / optimizer ignores the inline(never) hint here and inlines it anyway, resulting in the sequential values all being the same.像这样的事情应该从等式中删除打印，但是编译器/优化器忽略这里的inline(never)提示并且无论如何内联它，导致顺序值都相同。

/// SAFETY:
/// The length of `rsp` and the value of `x` must always match
#[inline(never)]
unsafe fn useless_function(x: usize, rsp: &mut [usize]) {
    if x > 0 {
        *rsp.get_unchecked_mut(0) = get_rsp();
        useless_function(x - 1, rsp.get_unchecked_mut(1..));
    }
}

fn main() {
    unsafe {
        let mut rsp = [0; 10];
        useless_function(rsp.len(), &mut rsp);

        for w in rsp.windows(2) {
            println!("{}", w[0] - w[1]);
        }
    }
}

That said, you can make the function public and look at its assembly anyway (lightly cleaned):也就是说，您可以公开该函数并查看其程序集（稍微清理一下）：

playground::useless_function:
    pushq   %r15
    pushq   %r14
    pushq   %rbx
    testq   %rdi, %rdi
    je  .LBB6_3
    movq    %rsi, %r14
    movq    %rdi, %r15
    xorl    %ebx, %ebx

.LBB6_2:
    callq   playground::get_rsp
    movq    %rax, (%r14,%rbx,8)
    addq    $1, %rbx
    cmpq    %rbx, %r15
    jne .LBB6_2

.LBB6_3:
    popq    %rbx
    popq    %r14
    popq    %r15
    retq

but in debug mode each frame still takes 80 bytes但在调试模式下，每帧仍然需要 80 个字节

Compare the unoptimized assembly:比较未优化的程序集：

playground::useless_function:
    subq    $104, %rsp
    movq    %rdi, 80(%rsp)
    movq    %rsi, 88(%rsp)
    movq    %rdx, 96(%rsp)
    cmpq    $0, %rdi
    movq    %rdi, 56(%rsp)                  # 8-byte Spill
    movq    %rsi, 48(%rsp)                  # 8-byte Spill
    movq    %rdx, 40(%rsp)                  # 8-byte Spill
    ja  .LBB44_2
    jmp .LBB44_8

.LBB44_2:
    callq   playground::get_rsp
    movq    %rax, 32(%rsp)                  # 8-byte Spill
    xorl    %eax, %eax
    movl    %eax, %edx
    movq    48(%rsp), %rdi                  # 8-byte Reload
    movq    40(%rsp), %rsi                  # 8-byte Reload
    callq   core::slice::<impl [T]>::get_unchecked_mut
    movq    %rax, 24(%rsp)                  # 8-byte Spill
    movq    24(%rsp), %rax                  # 8-byte Reload
    movq    32(%rsp), %rcx                  # 8-byte Reload
    movq    %rcx, (%rax)
    movq    56(%rsp), %rdx                  # 8-byte Reload
    subq    $1, %rdx
    setb    %sil
    testb   $1, %sil
    movq    %rdx, 16(%rsp)                  # 8-byte Spill
    jne .LBB44_9
    movq    $1, 72(%rsp)
    movq    72(%rsp), %rdx
    movq    48(%rsp), %rdi                  # 8-byte Reload
    movq    40(%rsp), %rsi                  # 8-byte Reload
    callq   core::slice::<impl [T]>::get_unchecked_mut
    movq    %rax, 8(%rsp)                   # 8-byte Spill
    movq    %rdx, (%rsp)                    # 8-byte Spill
    movq    16(%rsp), %rdi                  # 8-byte Reload
    movq    8(%rsp), %rsi                   # 8-byte Reload
    movq    (%rsp), %rdx                    # 8-byte Reload
    callq   playground::useless_function
    jmp .LBB44_8

.LBB44_8:
    addq    $104, %rsp
    retq

.LBB44_9:
    leaq    str.0(%rip), %rdi
    leaq    .L__unnamed_7(%rip), %rdx
    movq    core::panicking::panic@GOTPCREL(%rip), %rax
    movl    $33, %esi
    callq   *%rax
    ud2

Answer 2

This answer shows how this works in asm for an un-optimized C++ version.这个答案显示了它在未优化的 C++ 版本的 asm 中是如何工作的。

This might not tell us as much as I thought about Rust;这可能不像我对 Rust 的看法那样告诉我们； apparently Rust uses its own ABI / calling convention so it won't have "shadow space" making its stack frames bulkier on Windows. 显然， Rust 使用自己的 ABI / 调用约定，因此它不会有“阴影空间”，使其堆栈帧在 Windows 上变得更大。 The first version of my answer guessed that it would follow the Windows calling convention for calls to other Rust functions, when targeting Windows.我的答案的第一个版本猜测它会在面向 Windows 时遵循 Windows 调用约定来调用其他 Rust 函数。 I've adjusted the wording, but I didn't delete it even though it's potentially not relevant to Rust.我已经调整了措辞，但我没有删除它，即使它可能与 Rust 无关。

After further research, at least in 2016 Rust's ABI happens to match the platform calling convention on Windows x64 , at least if disassembly of the debug-build binary in this random tutorial is representative of anything.经过进一步研究，至少在 2016 年，Rust 的 ABI 恰好与 Windows x64 上的平台调用约定相匹配，至少如果本随机教程中的 debug-build 二进制文件的反汇编具有代表性的话。 heap::allocate::h80a36d45ddaa4ae3Lca in the disassembly clearly takes args in RCX and RDX, (spills and reloads them to the stack), then calls another function with those args.反汇编中的heap::allocate::h80a36d45ddaa4ae3Lca清楚地采用 RCX 和 RDX 中的参数（溢出并将它们重新加载到堆栈中），然后使用这些参数调用另一个函数。 Leaving 0x20 bytes of space unused above RSP before the call, ie shadow space.在调用之前，在 RSP 上方留下 0x20 字节的未使用空间，即影子空间。

If nothing has changed since 2016 (easily possible), I think this answer does reflect some of what Rust does when compiling for Windows.如果自 2016 年以来没有任何变化（很容易），我认为这个答案确实反映了 Rust 在为 Windows 编译时所做的一些事情。

The recursion gets optimised away in release mode, but in debug mode each frame still takes 80 bytes which is way more than I anticipated.递归在发布模式下得到优化，但在调试模式下，每帧仍然需要 80 个字节，这比我预期的要多得多。 Is this just the way stack frames work on x86?这只是堆栈帧在 x86 上的工作方式吗？ Do other languages do better?其他语言做得更好吗？

Yes, C and C++ do better: 48 or 64 bytes per stack frame on Windows, 32 on Linux.是的，C 和 C++ 做得更好：Windows 上每个堆栈帧 48 或 64 个字节，Linux 上每个堆栈帧 32 个。

The Windows x64 calling convention requires a caller to reserve 32 bytes of shadow space (basically unused stack-arg space above the return address) for use by the callee. Windows x64 调用约定要求调用方保留 32 字节的影子空间（返回地址上方基本上未使用的堆栈参数空间）以供被调用方使用。 But it looks like un-optimized clang builds may not take advantage of that shadow space, allocating extra space to spill local vars.但看起来未优化的 clang 构建可能不会利用阴影空间，分配额外的空间来溢出本地变量。

Also, the return address takes 8 bytes, and re-aligning the stack by 16 before another call takes another 8 bytes, so the minimum you can hope for is 48 bytes on Windows (unless you enable optimization, then as you say, tail-recursion easily optimizes into a loop).此外，返回地址需要 8 个字节，并且在另一个调用需要另外 8 个字节之前将堆栈重新对齐 16 个字节，因此您希望在 Windows 上的最小值为 48 个字节（除非您启用优化，然后如您所说，tail-递归很容易优化成一个循环）。 GCC compiling a C or C++ version of that code does achieve that. GCC 编译该代码的 C 或 C++ 版本确实实现了这一点。

Compiling for Linux, or any other x86-64 target that uses the x86-64 System V ABI, gcc and clang manage 32 bytes per frame for a C or C++ version.针对 Linux 或任何其他使用 x86-64 System V ABI、gcc 和 clang 的 x86-64 目标进行编译，为 C 或 C++ 版本管理每帧 32 字节。 Just ret addr, saved RBP, and another 16 bytes to keep alignment while making room to spill 8-byte x .只需 ret addr、保存 RBP 和另外 16 个字节以保持对齐，同时腾出空间来溢出 8 字节x 。 (Compiling as C or as C++ makes no difference to the asm). （编译为 C 或 C++ 与 asm 没有区别）。

I tried GCC and clang on an un-optimized C++ version using the Windows calling convention on the Godbolt compiler explorer .我在 Godbolt 编译器资源管理器上使用 Windows 调用约定在未优化的 C++ 版本上尝试了 GCC 和 clang。 To just look at the asm for useless_function , there was no need to write a main or get_rsp .只需查看useless_function的 asm ，就无需编写main或get_rsp 。

#include <stdlib.h>

#define MS_ABI __attribute__((ms_abi))   // for GNU C compilers.  Godbolt link has an ifdeffed version of this

void * RSPS[10] = {0};

MS_ABI void *get_rsp(void);
MS_ABI void useless_function(size_t x) {
    RSPS[x] = get_rsp();
    if (x == 0) {
        return;
    }
    useless_function(x - 1);
}

clang/LLVM un-optimized does push rbp / sub rsp, 48 , so a total of 64 bytes per frame (including the return address).未优化的 clang/LLVM 确实push rbp / sub rsp, 48 ，因此每帧总共 64 个字节（包括返回地址）。 GCC does push / sub rsp,32 , for a total of only 48 bytes per frame, as predicted.正如预测的那样sub rsp,32 GCC 确实推送/ sub rsp,32 ，每帧总共只有 48 个字节。

So apparently un-optimized LLVM does allocate "unneeded" space because it fails to use the shadow space allocated by the caller.因此，显然未优化的 LLVM 确实分配了“不需要的”空间，因为它无法使用调用者分配的影子空间。 If Rust used shadow space, this might explains some of why your debug-mode Rust version might use more stack space than we might expect, even with printing done outside the recursive function.如果 Rust 使用了影子空间，这可能解释了为什么您的调试模式 Rust 版本可能使用比我们预期的更多的堆栈空间，即使在递归函数之外完成打印。 (Printing uses a lot of space for locals). （印刷为当地人占用了大量空间）。

But part of that explanation must also include having some locals that take more space, eg perhaps for pointer locals or bounds checks?但是该解释的一部分还必须包括一些需要更多空间的局部变量，例如可能用于指针局部变量或边界检查？ C and C++ map pretty directly to asm, with access to globals not needing any extra stack space. C 和 C++ 直接映射到 asm，无需任何额外的堆栈空间即可访问全局变量。 (Or even extra registers, when the global array can be assumed to be in the low 2GiB of virtual address space, so it's address is usable as a 32-bit signed displacement in combination with other registers.) （或者甚至是额外的寄存器，当可以假设全局数组位于虚拟地址空间的低 2GiB 时，它的地址可用作与其他寄存器组合的 32 位有符号位移。）

# clang 10.0.1 -O0, for Windows x64
useless_function(unsigned long):
        push    rbp
        mov     rbp, rsp                  # set up a legacy frame pointer.
        sub     rsp, 48                   # reserve enough for shadow space (32) + 16, maintaining stack alignment.
        mov     qword ptr [rbp - 8], rcx   # spill incoming arg to newly reserved space above the shadow space
        call    get_rsp()
...

The only space for locals used on the stack is for x , no invented temporaries as part of array access.堆栈上用于局部变量的唯一空间是用于x ，没有发明的临时变量作为数组访问的一部分。 It's just a reload of x then mov qword ptr [8*rcx + RSPS], rax to store the function call return value.它只是重新加载x然后mov qword ptr [8*rcx + RSPS], rax来存储函数调用返回值。

# GCC10.2 -O0, for Windows x64
useless_function(unsigned long):
        push    rbp
        mov     rbp, rsp
        sub     rsp, 32                   # just reserve enough for shadow space for callee
        mov     QWORD PTR [rbp+16], rcx   # spill incoming arg to our own shadow space
        call    get_rsp()
...

Without the ms_abi attribute, both GCC and clang use sub rsp, 16 .如果没有ms_abi属性，GCC 和 clang 都使用sub rsp, 16 。

为什么 Rust 堆栈框架如此之大？

问题描述

2 个解决方案

解决方案1
10 2020-09-22 19:17:34

解决方案2
3 已采纳 2020-09-22 20:42:34

为什么 Rust 堆栈框架如此之大？

问题描述

2 个解决方案

解决方案1 10 2020-09-22 19:17:34

解决方案2 3 已采纳 2020-09-22 20:42:34

解决方案1
10 2020-09-22 19:17:34

解决方案2
3 已采纳 2020-09-22 20:42:34