Why are Rust stack frames so big?

Question

I encountered an unexpectedly early stack overflow and created the following program to test the issue:

#![feature(asm)]
#[inline(never)]
fn get_rsp() -> usize {
    let rsp: usize;
    unsafe {
        asm! {
            "mov {}, rsp",
            out(reg) rsp
        }
    }
    rsp
}

fn useless_function(x: usize) {
    if x > 0 {
        println!("{:x}", get_rsp());
        useless_function(x - 1);
    }
}

fn main() {
    useless_function(10);
}

This is get_rsp disassembled (according to cargo-asm ):

tests::get_rsp:
 push    rax
 #APP
 mov     rax, rsp
 #NO_APP
 pop     rcx
 ret

I'm not sure what #APP and #NO_APP do or why rax is pushed and then popped into rcx , but it seems the function does return the stack pointer.

I was surprised to find that in debug mode, the difference between two consecutively printed rsp was 192(!) and even in release mode it was 128. As far as I understand, all that needs to be stored for each call to useless_function is one usize and a return address, so I'd expect every stack frame to be around 16 bytes large.

I'm running this with rustc 1.46.0 on a 64-bit Windows machine.

Are my results consistent across machine? How is this explained?

It seems that the use of println! has a pretty significant effect. In an attempt to avoid that, I changed the program (Thanks to @Shepmaster for the idea) to store the values in a static array:

static mut RSPS: [usize; 10] = [0; 10];

#[inline(never)]
fn useless_function(x: usize) {
    unsafe { RSPS[x] = get_rsp() };
    if x == 0 {
        return;
    }
    useless_function(x - 1);
}

fn main() {
    useless_function(9);
    println!("{:?}", unsafe { RSPS });
}

The recursion gets optimised away in release mode, but in debug mode each frame still takes 80 bytes which is way more than I anticipated. Is this just the way stack frames work on x86? Do other languages do better? This seems a little inefficient.

Answer 1

Using formatting machinery like println! creates a number of things on the stack. Expanding the macros used in your code:

fn useless_function(x: usize) {
    if x > 0 {
        {
            ::std::io::_print(::core::fmt::Arguments::new_v1(
                &["", "\n"],
                &match (&get_rsp(),) {
                    (arg0,) => [::core::fmt::ArgumentV1::new(
                        arg0,
                        ::core::fmt::LowerHex::fmt,
                    )],
                },
            ));
        };
        useless_function(x - 1);
    }
}

I believe that those structs consume the majority of the space. As an attempt to prove that, I printed the size of the value created by format_args , which is used by println! :

let sz = std::mem::size_of_val(&format_args!("{:x}", get_rsp()));
println!("{}", sz);

This shows that it is 48 bytes.

See also:

How do I see the expanded macro code that's causing my compile error?

Something like this should remove the printing from the equation, but the compiler / optimizer ignores the inline(never) hint here and inlines it anyway, resulting in the sequential values all being the same.

/// SAFETY:
/// The length of `rsp` and the value of `x` must always match
#[inline(never)]
unsafe fn useless_function(x: usize, rsp: &mut [usize]) {
    if x > 0 {
        *rsp.get_unchecked_mut(0) = get_rsp();
        useless_function(x - 1, rsp.get_unchecked_mut(1..));
    }
}

fn main() {
    unsafe {
        let mut rsp = [0; 10];
        useless_function(rsp.len(), &mut rsp);

        for w in rsp.windows(2) {
            println!("{}", w[0] - w[1]);
        }
    }
}

That said, you can make the function public and look at its assembly anyway (lightly cleaned):

playground::useless_function:
    pushq   %r15
    pushq   %r14
    pushq   %rbx
    testq   %rdi, %rdi
    je  .LBB6_3
    movq    %rsi, %r14
    movq    %rdi, %r15
    xorl    %ebx, %ebx

.LBB6_2:
    callq   playground::get_rsp
    movq    %rax, (%r14,%rbx,8)
    addq    $1, %rbx
    cmpq    %rbx, %r15
    jne .LBB6_2

.LBB6_3:
    popq    %rbx
    popq    %r14
    popq    %r15
    retq

but in debug mode each frame still takes 80 bytes

Compare the unoptimized assembly:

playground::useless_function:
    subq    $104, %rsp
    movq    %rdi, 80(%rsp)
    movq    %rsi, 88(%rsp)
    movq    %rdx, 96(%rsp)
    cmpq    $0, %rdi
    movq    %rdi, 56(%rsp)                  # 8-byte Spill
    movq    %rsi, 48(%rsp)                  # 8-byte Spill
    movq    %rdx, 40(%rsp)                  # 8-byte Spill
    ja  .LBB44_2
    jmp .LBB44_8

.LBB44_2:
    callq   playground::get_rsp
    movq    %rax, 32(%rsp)                  # 8-byte Spill
    xorl    %eax, %eax
    movl    %eax, %edx
    movq    48(%rsp), %rdi                  # 8-byte Reload
    movq    40(%rsp), %rsi                  # 8-byte Reload
    callq   core::slice::<impl [T]>::get_unchecked_mut
    movq    %rax, 24(%rsp)                  # 8-byte Spill
    movq    24(%rsp), %rax                  # 8-byte Reload
    movq    32(%rsp), %rcx                  # 8-byte Reload
    movq    %rcx, (%rax)
    movq    56(%rsp), %rdx                  # 8-byte Reload
    subq    $1, %rdx
    setb    %sil
    testb   $1, %sil
    movq    %rdx, 16(%rsp)                  # 8-byte Spill
    jne .LBB44_9
    movq    $1, 72(%rsp)
    movq    72(%rsp), %rdx
    movq    48(%rsp), %rdi                  # 8-byte Reload
    movq    40(%rsp), %rsi                  # 8-byte Reload
    callq   core::slice::<impl [T]>::get_unchecked_mut
    movq    %rax, 8(%rsp)                   # 8-byte Spill
    movq    %rdx, (%rsp)                    # 8-byte Spill
    movq    16(%rsp), %rdi                  # 8-byte Reload
    movq    8(%rsp), %rsi                   # 8-byte Reload
    movq    (%rsp), %rdx                    # 8-byte Reload
    callq   playground::useless_function
    jmp .LBB44_8

.LBB44_8:
    addq    $104, %rsp
    retq

.LBB44_9:
    leaq    str.0(%rip), %rdi
    leaq    .L__unnamed_7(%rip), %rdx
    movq    core::panicking::panic@GOTPCREL(%rip), %rax
    movl    $33, %esi
    callq   *%rax
    ud2

Answer 2

This answer shows how this works in asm for an un-optimized C++ version.

This might not tell us as much as I thought about Rust; apparently Rust uses its own ABI / calling convention so it won't have "shadow space" making its stack frames bulkier on Windows. The first version of my answer guessed that it would follow the Windows calling convention for calls to other Rust functions, when targeting Windows. I've adjusted the wording, but I didn't delete it even though it's potentially not relevant to Rust.

After further research, at least in 2016 Rust's ABI happens to match the platform calling convention on Windows x64 , at least if disassembly of the debug-build binary in this random tutorial is representative of anything. heap::allocate::h80a36d45ddaa4ae3Lca in the disassembly clearly takes args in RCX and RDX, (spills and reloads them to the stack), then calls another function with those args. Leaving 0x20 bytes of space unused above RSP before the call, ie shadow space.

If nothing has changed since 2016 (easily possible), I think this answer does reflect some of what Rust does when compiling for Windows.

The recursion gets optimised away in release mode, but in debug mode each frame still takes 80 bytes which is way more than I anticipated. Is this just the way stack frames work on x86? Do other languages do better?

Yes, C and C++ do better: 48 or 64 bytes per stack frame on Windows, 32 on Linux.

The Windows x64 calling convention requires a caller to reserve 32 bytes of shadow space (basically unused stack-arg space above the return address) for use by the callee. But it looks like un-optimized clang builds may not take advantage of that shadow space, allocating extra space to spill local vars.

Also, the return address takes 8 bytes, and re-aligning the stack by 16 before another call takes another 8 bytes, so the minimum you can hope for is 48 bytes on Windows (unless you enable optimization, then as you say, tail-recursion easily optimizes into a loop). GCC compiling a C or C++ version of that code does achieve that.

Compiling for Linux, or any other x86-64 target that uses the x86-64 System V ABI, gcc and clang manage 32 bytes per frame for a C or C++ version. Just ret addr, saved RBP, and another 16 bytes to keep alignment while making room to spill 8-byte x . (Compiling as C or as C++ makes no difference to the asm).

I tried GCC and clang on an un-optimized C++ version using the Windows calling convention on the Godbolt compiler explorer . To just look at the asm for useless_function , there was no need to write a main or get_rsp .

#include <stdlib.h>

#define MS_ABI __attribute__((ms_abi))   // for GNU C compilers.  Godbolt link has an ifdeffed version of this

void * RSPS[10] = {0};

MS_ABI void *get_rsp(void);
MS_ABI void useless_function(size_t x) {
    RSPS[x] = get_rsp();
    if (x == 0) {
        return;
    }
    useless_function(x - 1);
}

clang/LLVM un-optimized does push rbp / sub rsp, 48 , so a total of 64 bytes per frame (including the return address). GCC does push / sub rsp,32 , for a total of only 48 bytes per frame, as predicted.

So apparently un-optimized LLVM does allocate "unneeded" space because it fails to use the shadow space allocated by the caller. If Rust used shadow space, this might explains some of why your debug-mode Rust version might use more stack space than we might expect, even with printing done outside the recursive function. (Printing uses a lot of space for locals).

But part of that explanation must also include having some locals that take more space, eg perhaps for pointer locals or bounds checks? C and C++ map pretty directly to asm, with access to globals not needing any extra stack space. (Or even extra registers, when the global array can be assumed to be in the low 2GiB of virtual address space, so it's address is usable as a 32-bit signed displacement in combination with other registers.)

# clang 10.0.1 -O0, for Windows x64
useless_function(unsigned long):
        push    rbp
        mov     rbp, rsp                  # set up a legacy frame pointer.
        sub     rsp, 48                   # reserve enough for shadow space (32) + 16, maintaining stack alignment.
        mov     qword ptr [rbp - 8], rcx   # spill incoming arg to newly reserved space above the shadow space
        call    get_rsp()
...

The only space for locals used on the stack is for x , no invented temporaries as part of array access. It's just a reload of x then mov qword ptr [8*rcx + RSPS], rax to store the function call return value.

# GCC10.2 -O0, for Windows x64
useless_function(unsigned long):
        push    rbp
        mov     rbp, rsp
        sub     rsp, 32                   # just reserve enough for shadow space for callee
        mov     QWORD PTR [rbp+16], rcx   # spill incoming arg to our own shadow space
        call    get_rsp()
...

Without the ms_abi attribute, both GCC and clang use sub rsp, 16 .

Why are Rust stack frames so big?

Question

2 answers

solution1
10 2020-09-22 19:17:34

solution2
3 ACCPTED 2020-09-22 20:42:34

Why are Rust stack frames so big?

Question

2 answers

solution1 10 2020-09-22 19:17:34

solution2 3 ACCPTED 2020-09-22 20:42:34

solution1
10 2020-09-22 19:17:34

solution2
3 ACCPTED 2020-09-22 20:42:34