為什么 Rust 堆棧框架如此之大？

Question

我遇到了意外的早期堆棧溢出並創建了以下程序來測試該問題：

#![feature(asm)]
#[inline(never)]
fn get_rsp() -> usize {
    let rsp: usize;
    unsafe {
        asm! {
            "mov {}, rsp",
            out(reg) rsp
        }
    }
    rsp
}

fn useless_function(x: usize) {
    if x > 0 {
        println!("{:x}", get_rsp());
        useless_function(x - 1);
    }
}

fn main() {
    useless_function(10);
}

這是get_rsp反匯編（根據cargo-asm ）：

tests::get_rsp:
 push    rax
 #APP
 mov     rax, rsp
 #NO_APP
 pop     rcx
 ret

我不確定#APP和#NO_APP做什么或為什么rax被推送然后彈出到rcx ，但似乎該函數確實返回了堆棧指針。

我很驚訝地發現在調試模式下，兩個連續打印的rsp之間的差異是 192(!)，甚至在發布模式下也是 128。據我所知，每次調用useless_function需要存儲的只是一個usize和一個返回地址，所以我希望每個堆棧幀大約 16 字節大。

我在 64 位 Windows 機器上使用rustc 1.46.0運行它。

我的結果在機器上是否一致？ 這是如何解釋的？

好象是使用了println! 有相當顯着的效果。 為了避免這種情況，我更改了程序（感謝@Shepmaster 的想法）以將值存儲在靜態數組中：

static mut RSPS: [usize; 10] = [0; 10];

#[inline(never)]
fn useless_function(x: usize) {
    unsafe { RSPS[x] = get_rsp() };
    if x == 0 {
        return;
    }
    useless_function(x - 1);
}

fn main() {
    useless_function(9);
    println!("{:?}", unsafe { RSPS });
}

遞歸在發布模式下得到優化，但在調試模式下，每幀仍然需要 80 個字節，這比我預期的要多得多。 這只是堆棧幀在 x86 上的工作方式嗎？ 其他語言做得更好嗎？ 這似乎有點低效。

Answer 1

使用像println!這樣的格式化機制println! 在堆棧上創建了許多東西。 擴展代碼中使用的宏：

fn useless_function(x: usize) {
    if x > 0 {
        {
            ::std::io::_print(::core::fmt::Arguments::new_v1(
                &["", "\n"],
                &match (&get_rsp(),) {
                    (arg0,) => [::core::fmt::ArgumentV1::new(
                        arg0,
                        ::core::fmt::LowerHex::fmt,
                    )],
                },
            ));
        };
        useless_function(x - 1);
    }
}

我相信這些結構占用了大部分空間。 為了證明這一點，我打印了format_args創建的值的大小，它被println! ：

let sz = std::mem::size_of_val(&format_args!("{:x}", get_rsp()));
println!("{}", sz);

這表明它是48個字節。

也可以看看：

如何查看導致編譯錯誤的擴展宏代碼？

像這樣的事情應該從等式中刪除打印，但是編譯器/優化器忽略這里的inline(never)提示並且無論如何內聯它，導致順序值都相同。

/// SAFETY:
/// The length of `rsp` and the value of `x` must always match
#[inline(never)]
unsafe fn useless_function(x: usize, rsp: &mut [usize]) {
    if x > 0 {
        *rsp.get_unchecked_mut(0) = get_rsp();
        useless_function(x - 1, rsp.get_unchecked_mut(1..));
    }
}

fn main() {
    unsafe {
        let mut rsp = [0; 10];
        useless_function(rsp.len(), &mut rsp);

        for w in rsp.windows(2) {
            println!("{}", w[0] - w[1]);
        }
    }
}

也就是說，您可以公開該函數並查看其程序集（稍微清理一下）：

playground::useless_function:
    pushq   %r15
    pushq   %r14
    pushq   %rbx
    testq   %rdi, %rdi
    je  .LBB6_3
    movq    %rsi, %r14
    movq    %rdi, %r15
    xorl    %ebx, %ebx

.LBB6_2:
    callq   playground::get_rsp
    movq    %rax, (%r14,%rbx,8)
    addq    $1, %rbx
    cmpq    %rbx, %r15
    jne .LBB6_2

.LBB6_3:
    popq    %rbx
    popq    %r14
    popq    %r15
    retq

但在調試模式下，每幀仍然需要 80 個字節

比較未優化的程序集：

playground::useless_function:
    subq    $104, %rsp
    movq    %rdi, 80(%rsp)
    movq    %rsi, 88(%rsp)
    movq    %rdx, 96(%rsp)
    cmpq    $0, %rdi
    movq    %rdi, 56(%rsp)                  # 8-byte Spill
    movq    %rsi, 48(%rsp)                  # 8-byte Spill
    movq    %rdx, 40(%rsp)                  # 8-byte Spill
    ja  .LBB44_2
    jmp .LBB44_8

.LBB44_2:
    callq   playground::get_rsp
    movq    %rax, 32(%rsp)                  # 8-byte Spill
    xorl    %eax, %eax
    movl    %eax, %edx
    movq    48(%rsp), %rdi                  # 8-byte Reload
    movq    40(%rsp), %rsi                  # 8-byte Reload
    callq   core::slice::<impl [T]>::get_unchecked_mut
    movq    %rax, 24(%rsp)                  # 8-byte Spill
    movq    24(%rsp), %rax                  # 8-byte Reload
    movq    32(%rsp), %rcx                  # 8-byte Reload
    movq    %rcx, (%rax)
    movq    56(%rsp), %rdx                  # 8-byte Reload
    subq    $1, %rdx
    setb    %sil
    testb   $1, %sil
    movq    %rdx, 16(%rsp)                  # 8-byte Spill
    jne .LBB44_9
    movq    $1, 72(%rsp)
    movq    72(%rsp), %rdx
    movq    48(%rsp), %rdi                  # 8-byte Reload
    movq    40(%rsp), %rsi                  # 8-byte Reload
    callq   core::slice::<impl [T]>::get_unchecked_mut
    movq    %rax, 8(%rsp)                   # 8-byte Spill
    movq    %rdx, (%rsp)                    # 8-byte Spill
    movq    16(%rsp), %rdi                  # 8-byte Reload
    movq    8(%rsp), %rsi                   # 8-byte Reload
    movq    (%rsp), %rdx                    # 8-byte Reload
    callq   playground::useless_function
    jmp .LBB44_8

.LBB44_8:
    addq    $104, %rsp
    retq

.LBB44_9:
    leaq    str.0(%rip), %rdi
    leaq    .L__unnamed_7(%rip), %rdx
    movq    core::panicking::panic@GOTPCREL(%rip), %rax
    movl    $33, %esi
    callq   *%rax
    ud2

Answer 2

這個答案顯示了它在未優化的 C++ 版本的 asm 中是如何工作的。

這可能不像我對 Rust 的看法那樣告訴我們； 顯然， Rust 使用自己的 ABI / 調用約定，因此它不會有“陰影空間”，使其堆棧幀在 Windows 上變得更大。 我的答案的第一個版本猜測它會在面向 Windows 時遵循 Windows 調用約定來調用其他 Rust 函數。 我已經調整了措辭，但我沒有刪除它，即使它可能與 Rust 無關。

經過進一步研究，至少在 2016 年，Rust 的 ABI 恰好與 Windows x64 上的平台調用約定相匹配，至少如果本隨機教程中的 debug-build 二進制文件的反匯編具有代表性的話。 反匯編中的heap::allocate::h80a36d45ddaa4ae3Lca清楚地采用 RCX 和 RDX 中的參數（溢出並將它們重新加載到堆棧中），然后使用這些參數調用另一個函數。 在調用之前，在 RSP 上方留下 0x20 字節的未使用空間，即影子空間。

如果自 2016 年以來沒有任何變化（很容易），我認為這個答案確實反映了 Rust 在為 Windows 編譯時所做的一些事情。

遞歸在發布模式下得到優化，但在調試模式下，每幀仍然需要 80 個字節，這比我預期的要多得多。 這只是堆棧幀在 x86 上的工作方式嗎？ 其他語言做得更好嗎？

是的，C 和 C++ 做得更好：Windows 上每個堆棧幀 48 或 64 個字節，Linux 上每個堆棧幀 32 個。

Windows x64 調用約定要求調用方保留 32 字節的影子空間（返回地址上方基本上未使用的堆棧參數空間）以供被調用方使用。 但看起來未優化的 clang 構建可能不會利用陰影空間，分配額外的空間來溢出本地變量。

此外，返回地址需要 8 個字節，並且在另一個調用需要另外 8 個字節之前將堆棧重新對齊 16 個字節，因此您希望在 Windows 上的最小值為 48 個字節（除非您啟用優化，然后如您所說，tail-遞歸很容易優化成一個循環）。 GCC 編譯該代碼的 C 或 C++ 版本確實實現了這一點。

針對 Linux 或任何其他使用 x86-64 System V ABI、gcc 和 clang 的 x86-64 目標進行編譯，為 C 或 C++ 版本管理每幀 32 字節。 只需 ret addr、保存 RBP 和另外 16 個字節以保持對齊，同時騰出空間來溢出 8 字節x 。 （編譯為 C 或 C++ 與 asm 沒有區別）。

我在 Godbolt 編譯器資源管理器上使用 Windows 調用約定在未優化的 C++ 版本上嘗試了 GCC 和 clang。 只需查看useless_function的 asm ，就無需編寫main或get_rsp 。

#include <stdlib.h>

#define MS_ABI __attribute__((ms_abi))   // for GNU C compilers.  Godbolt link has an ifdeffed version of this

void * RSPS[10] = {0};

MS_ABI void *get_rsp(void);
MS_ABI void useless_function(size_t x) {
    RSPS[x] = get_rsp();
    if (x == 0) {
        return;
    }
    useless_function(x - 1);
}

未優化的 clang/LLVM 確實push rbp / sub rsp, 48 ，因此每幀總共 64 個字節（包括返回地址）。 正如預測的那樣sub rsp,32 GCC 確實推送/ sub rsp,32 ，每幀總共只有 48 個字節。

因此，顯然未優化的 LLVM 確實分配了“不需要的”空間，因為它無法使用調用者分配的影子空間。 如果 Rust 使用了影子空間，這可能解釋了為什么您的調試模式 Rust 版本可能使用比我們預期的更多的堆棧空間，即使在遞歸函數之外完成打印。 （印刷為當地人占用了大量空間）。

但是該解釋的一部分還必須包括一些需要更多空間的局部變量，例如可能用於指針局部變量或邊界檢查？ C 和 C++ 直接映射到 asm，無需任何額外的堆棧空間即可訪問全局變量。 （或者甚至是額外的寄存器，當可以假設全局數組位於虛擬地址空間的低 2GiB 時，它的地址可用作與其他寄存器組合的 32 位有符號位移。）

# clang 10.0.1 -O0, for Windows x64
useless_function(unsigned long):
        push    rbp
        mov     rbp, rsp                  # set up a legacy frame pointer.
        sub     rsp, 48                   # reserve enough for shadow space (32) + 16, maintaining stack alignment.
        mov     qword ptr [rbp - 8], rcx   # spill incoming arg to newly reserved space above the shadow space
        call    get_rsp()
...

堆棧上用於局部變量的唯一空間是用於x ，沒有發明的臨時變量作為數組訪問的一部分。 它只是重新加載x然后mov qword ptr [8*rcx + RSPS], rax來存儲函數調用返回值。

# GCC10.2 -O0, for Windows x64
useless_function(unsigned long):
        push    rbp
        mov     rbp, rsp
        sub     rsp, 32                   # just reserve enough for shadow space for callee
        mov     QWORD PTR [rbp+16], rcx   # spill incoming arg to our own shadow space
        call    get_rsp()
...

如果沒有ms_abi屬性，GCC 和 clang 都使用sub rsp, 16 。

為什么 Rust 堆棧框架如此之大？

問題描述

2 個解決方案

解決方案1
10 2020-09-22 19:17:34

解決方案2
3 已采納 2020-09-22 20:42:34

為什么 Rust 堆棧框架如此之大？

問題描述

2 個解決方案

解決方案1 10 2020-09-22 19:17:34

解決方案2 3 已采納 2020-09-22 20:42:34

解決方案1
10 2020-09-22 19:17:34

解決方案2
3 已采納 2020-09-22 20:42:34