简体   繁体   English

什么 C/C++ 编译器可以使用 push pop 指令来创建局部变量,而不是仅仅增加一次 esp?

[英]What C/C++ compiler can use push pop instructions for creating local variables, instead of just increasing esp once?

I belive push/pop instructions will result in a more compact code, maybe will even run slightly faster.我相信 push/pop 指令会产生更紧凑的代码,甚至可能会运行得稍微快一点。 This requires disabling stack frames as well though.不过,这也需要禁用堆栈帧。

To check this, I will need to either rewrite a large enough program in assembly by hand (to compare them), or to install and study a few other compilers (to see if they have an option for this, and to compare the results).为了检查这一点,我需要手工重写一个足够大的汇编程序(比较它们),或者安装和研究一些其他编译器(看看他们是否有这个选项,并比较结果) .

Here is the forum topic about this and simular problems.这是有关此问题和类似问题的论坛主题

In short, I want to understand which code is better.简而言之,我想了解哪个代码更好。 Code like this:像这样的代码:

sub esp, c
mov [esp+8],eax
mov [esp+4],ecx
mov [esp],edx
...
add esp, c

or code like this:或这样的代码:

push eax
push ecx
push edx
...
add esp, c

What compiler can produce the second kind of code?什么编译器可以生成第二种代码? They usually produce some variation of the first one.他们通常会产生第一个的一些变体。

You're right, push is a minor missed-optimization with all 4 major x86 compilers .你是对的, push是对所有 4 个主要 x86 编译器的一个小的遗漏优化 There's some code-size, and thus indirectly performance to be had.有一些代码大小,因此间接性能。 Or maybe more directly a small amount of performance in some cases, eg saving a sub rsp instruction.或者在某些情况下可能更直接的少量性能,例如保存sub rsp指令。

But if you're not careful, you can make things slower with extra stack-sync uops by mixing push with [rsp+x] addressing modes.但是,如果您不小心,您可以通过push[rsp+x]寻址模式混合使用额外的堆栈同步 uops 使事情变慢。 pop doesn't sound useful, just push . pop听起来没什么用,只是push As the forum thread you linked suggests, you only use this for the initial store of locals;正如您链接的论坛帖子所暗示的那样,您仅将其用于本地人的初始商店; later reloads and stores should use normal addressing modes like [rsp+8] .以后重新加载和存储应该使用正常的寻址模式,如[rsp+8] We're not talking about trying to avoid mov loads/stores entirely, and we still want random access to the stack slots where we spilled local variables from registers!我们不是在谈论试图完全避免mov加载/存储,我们仍然希望随机访问我们从寄存器溢出局部变量的堆栈槽!

Modern code generators avoid using PUSH.现代代码生成器避免使用 PUSH。 It is inefficient on today's processors because it modifies the stack pointer, that gums-up a super-scalar core.它在当今的处理器上效率低下,因为它修改了堆栈指针,从而使超标量内核变得混乱。 (Hans Passant) (汉斯·帕桑特)

This was true 15 years ago , but compilers are once again using push when optimizing for speed, not just code-size. 15 年前确实如此,但编译器在优化速度时再次使用push ,而不仅仅是代码大小。 Compilers already use push / pop for saving/restoring call-preserved registers they want to use, like rbx , and for pushing stack args (mostly in 32-bit mode; in 64-bit mode most args fit in registers).编译器已经使用push / pop来保存/恢复他们想要使用的调用保留寄存器,如rbx ,以及用于推送堆栈参数(主要在 32 位模式下;在 64 位模式下,大多数 args 适合寄存器)。 Both of these things could be done with mov , but compilers use push because it's more efficient than sub rsp,8 / mov [rsp], rbx .这两件事都可以用mov来完成,但编译器使用push因为它比sub rsp,8 / mov [rsp], rbx更有效。 gcc has tuning options to avoid push / pop for these cases, enabled for -mtune=pentium3 and -mtune=pentium , and similar old CPUs, but not for modern CPUs. gcc具有避免push / pop这些情况的调整选项,为-mtune=pentium3-mtune=pentium以及类似的旧 CPU 启用,但不适用于现代 CPU。

Intel since Pentium-M and AMD since Bulldozer(?) have a "stack engine" that tracks the changes to RSP with zero latency and no ALU uops, for PUSH/POP/CALL/RET. 自 Pentium-M 以来的 Intel 和自 Bulldozer(?) 以来的 AMD 都有一个“堆栈引擎” ,可以以零延迟和无 ALU uops 跟踪 RSP 的变化,用于 PUSH/POP/CALL/RET。 Lots of real code was still using push/pop, so CPU designers added hardware to make it efficient.许多实际代码仍在使用 push/pop,因此 CPU 设计人员添加了硬件以使其高效。 Now we can use them (carefully!) when tuning for performance.现在我们可以在调整性能时使用它们(小心!)。 See Agner Fog's microarchitecture guide and instruction tables , and his asm optimization manual.请参阅Agner Fog 的微体系结构指南和指令表,以及他的 asm 优化手册。 They're excellent.他们很优秀。 (And other links in the x86 tag wiki .) (以及x86 标签 wiki中的其他链接。)

It's not perfect;它并不完美; reading RSP directly (when the offset from the value in the out-of-order core is nonzero) does cause a stack-sync uop to be inserted on Intel CPUs.直接读取 RSP(当与乱序内核中的值的偏移量非零时)确实会导致在 Intel CPU 上插入堆栈同步 uop。 eg push rax / mov [rsp-8], rdi is 3 total fused-domain uops: 2 stores and one stack-sync.例如push rax / mov [rsp-8], rdi总共有 3 个融合域 uops:2 个存储和一个堆栈同步。

On function entry, the "stack engine" is already in a non-zero-offset state (from the call in the parent), so using some push instructions before the first direct reference to RSP costs no extra uops at all.在函数入口,“堆栈引擎”已经处于非零偏移状态(来自父级中的call ),因此在第一次直接引用 RSP 之前使用一些push指令根本不需要额外的 uops。 (Unless we were tailcalled from another function with jmp , and that function didn't pop anything right before jmp .) (除非我们从另一个带有jmp函数进行尾调用,并且该函数在jmp之前没有pop任何内容。)

It's kind of funny that compilers have been using dummy push/pop instructions just to adjust the stack by 8 bytes for a while now, because it's so cheap and compact (if you're doing it once, not 10 times to allocate 80 bytes), but aren't taking advantage of it to store useful data.一段时间以来编译器一直使用虚拟的 push/pop 指令只是为了将堆栈调整 8 个字节,这有点有趣,因为它既便宜又紧凑(如果你只做一次,而不是 10 次来分配 80 个字节) ,但没有利用它来存储有用的数据。 The stack is almost always hot in cache, and modern CPUs have very excellent store / load bandwidth to L1d.堆栈在缓存中几乎总是很热,现代 CPU 具有非常出色的 L1d 存储/加载带宽。


int extfunc(int *,int *);

void foo() {
    int a=1, b=2;
    extfunc(&a, &b);
}

compiles with clang6.0 -O3 -march=haswell on the Godbolt compiler explorer See that link for all the rest of the code, and many different missed-optimizations and silly code-gen (see my comments in the C source pointing out some of them): 在 Godbolt 编译器资源管理器上使用clang6.0 -O3 -march=haswell 编译请参阅该链接以了解所有其余代码,以及许多不同的遗漏优化和愚蠢的代码生成(参见我在 C 源代码中的评论,指出了一些他们):

 # compiled for the x86-64 System V calling convention: 
 # integer args in rdi, rsi  (,rdx, rcx, r8, r9)
    push    rax               # clang / ICC ALREADY use push instead of sub rsp,8
    lea     rdi, [rsp + 4]
    mov     dword ptr [rdi], 1      # 6 bytes: opcode + modrm + imm32
    mov     rsi, rsp                # special case for lea rsi, [rsp + 0]
    mov     dword ptr [rsi], 2
    call    extfunc(int*, int*)
    pop     rax                     # and POP instead of add rsp,8
    ret

And very similar code with gcc, ICC, and MSVC, sometimes with the instructions in a different order, or gcc reserving an extra 16B of stack space for no reason.与 gcc、ICC 和 MSVC 非常相似的代码,有时指令的顺序不同,或者 gcc 无缘无故地保留了额外的 16B 堆栈空间。 (MSVC reserves more space because it's targeting the Windows x64 calling convention which reserves shadow space instead of having a red-zone). (MSVC 保留了更多空间,因为它针对的是 Windows x64 调用约定,该约定保留了阴影空间而不是红色区域)。

clang saves code-size by using the LEA results for store addresses instead of repeating RSP-relative addresses (SIB+disp8). clang 通过使用存储地址的 LEA 结果而不是重复 RSP 相关地址 (SIB+disp8) 来节省代码大小。 ICC and clang put the variables at the bottom of the space it reserved, so one of the addressing modes avoids a disp8 . ICC 和 clang 将变量放在它保留的空间的底部,因此其中一种寻址模式避免了disp8 (With 3 variables, reserving 24 bytes instead of 8 was necessary, and clang didn't take advantage then.) gcc and MSVC miss this optimization. (对于 3 个变量,保留 24 个字节而不是 8 个字节是必要的,并且 clang 那时没有利用。)gcc 和 MSVC 错过了这个优化。

But anyway, more optimal would be :但无论如何,更理想的是

    push    2                       # only 2 bytes
    lea     rdi, [rsp + 4]
    mov     dword ptr [rdi], 1
    mov     rsi, rsp                # special case for lea rsi, [rsp + 0]
    call    extfunc(int*, int*)
      # ... later accesses would use [rsp] and [rsp+] if needed, not pop
    pop     rax                     # alternative to add rsp,8
    ret

The push is an 8-byte store, and we overlap half of it. push是一个 8 字节的存储,我们重叠了它的一半。 This is not a problem, CPUs can store-forward the unmodified low half efficiently even after storing the high half.这不是问题,即使在存储了高半部分之后,CPU 也可以有效地存储转发未修改的低半部分。 Overlapping stores in general are not a problem, and in fact glibc's well-commented memcpy implementation uses two (potentially) overlapping loads + stores for small copies (up to the size of 2x xmm registers at least), to load everything then store everything without caring about whether or not there's overlap.重叠存储一般不是问题,实际上glibc 的评论良好的memcpy实现使用两个(可能)重叠的负载 + 存储用于小副本(至少达到 2x xmm 寄存器的大小),加载所有内容然后存储所有内容,而无需关心是否有重叠。

Note that in 64-bit mode, 32-bit push is not available .请注意,在 64 位模式下, 32 位push不可用 So we still have to reference rsp directly for the upper half of of the qword.所以,我们还是要参考rsp直接的四字的上半部分。 But if our variables were uint64_t, or we didn't care about making them contiguous, we could just use push .但是如果我们的变量是 uint64_t,或者我们不关心让它们连续,我们可以使用push

We have to reference RSP explicitly in this case to get pointers to the locals for passing to another function, so there's no getting around the extra stack-sync uop on Intel CPUs.在这种情况下,我们必须显式引用 RSP 以获取指向本地变量的指针以传递给另一个函数,因此无法绕过 Intel CPU 上的额外堆栈同步 uop。 In other cases maybe you just need to spill some function args for use after a call .在其他情况下,您可能只需要在call后溢出一些函数参数以供使用。 (Although normally compilers will push rbx and mov rbx,rdi to save an arg in a call-preserved register, instead of spilling/reloading the arg itself, to shorten the critical path.) (尽管通常编译器会push rbxmov rbx,rdi以将 arg 保存在调用保留的寄存器中,而不是溢出/重新加载 arg 本身,以缩短关键路径。)

I chose 2x 4-byte args so we could reach a 16-byte alignment boundary with 1 push , so we can optimize away the sub rsp, ## (or dummy push ) entirely.我选择了 2x 4 字节 args,因此我们可以使用 1 push达到 16 字节对齐边界,因此我们可以完全优化掉sub rsp, ## (或虚拟push )。

I could have used mov rax, 0x0000000200000001 / push rax , but 10-byte mov r64, imm64 takes 2 entries in the uop cache, and a lot of code-size.我本可以使用mov rax, 0x0000000200000001 / push rax ,但是 10 字节的mov r64, imm64mov r64, imm64需要 2 个条目,以及大量的代码大小。
gcc7 does know how to merge two adjacent stores, but chooses not to do that for mov in this case. gcc7 确实知道如何合并两个相邻的商店,但在这种情况下选择不对mov这样做。 If both constants had needed 32-bit immediates, it would have made sense.如果这两个常量都需要 32 位立即数,那就有意义了。 But if the values weren't actually constant at all, and came from registers, this wouldn't work while push / mov [rsp+4] would.但是,如果这些值实际上根本不是常数,而是来自寄存器,则这将不起作用,而push / mov [rsp+4]会起作用。 (It wouldn't be worth merging values in a register with SHL + SHLD or whatever other instructions to turn 2 stores into 1.) (将寄存器中的值与 SHL + SHLD 或任何其他将 2 个存储变为 1 个的指令合并是不值得的。)

If you need to reserve space for more than one 8-byte chunk, and don't have anything useful to store there yet, definitely use sub instead of multiple dummy PUSHes after the last useful PUSH.如果您需要为超过一个 8 字节的块保留空间,并且还没有任何有用的东西可以存储在那里,那么一定要在最后一个有用的 PUSH 之后使用sub而不是多个虚拟 PUSH。 But if you have useful stuff to store, push imm8 or push imm32, or push reg are good.但是如果你有有用的东西要存储,push imm8 或 push imm32,或者 push reg 都不错。

We can see more evidence of compilers using "canned" sequences with ICC output: it uses lea rdi, [rsp] in the arg setup for the call.我们可以看到编译器使用带有 ICC 输出的“固定”序列的更多证据:它在调用的 arg 设置中使用lea rdi, [rsp] It seems they didn't think to look for the special case of the address of a local being pointed to directly by a register, with no offset, allowing mov instead of lea .似乎他们没有想到寻找由寄存器直接指向的本地地址的特殊情况,没有偏移,允许mov而不是lea ( mov is definitely not worse, and better on some CPUs .) mov绝对不差,在某些 CPU 上更好。)


An interesting example of not making locals contiguous is a version of the above with 3 args , int a=1, b=2, c=3;一个不使局部变量连续的有趣示例是上述版本的 3 个 args , int a=1, b=2, c=3; . . To maintain 16B alignment, we now need to offset 8 + 16*1 = 24 bytes, so we could do为了保持 16B 对齐,我们现在需要偏移8 + 16*1 = 24个字节,所以我们可以这样做

bar3:
    push   3
    push   2               # don't interleave mov in here; extra stack-sync uops
    push   1
    mov    rdi, rsp
    lea    rsi, [rsp+8]
    lea    rdx, [rdi+16]         # relative to RDI to save a byte with probably no extra latency even if MOV isn't zero latency, at least not on the critical path
    call   extfunc3(int*,int*,int*)
    add    rsp, 24
    ret

This is significantly smaller code-size than compiler-generated code, because mov [rsp+16], 2 has to use the mov r/m32, imm32 encoding, using a 4-byte immediate because there's no sign_extended_imm8 form of mov .这比编译器生成的代码小得多,因为mov [rsp+16], 2必须使用mov r/m32, imm32编码,使用 4 字节立即数,因为没有 sign_extended_imm8 形式的mov

push imm8 is extremely compact, 2 bytes. push imm8非常紧凑,2 个字节。 mov dword ptr [rsp+8], 1 is 8 bytes: opcode + modrm + SIB + disp8 + imm32. mov dword ptr [rsp+8], 1是 8 个字节:操作码 + modrm + SIB + disp8 + imm32。 (RSP as a base register always needs a SIB byte; the ModRM encoding with base=RSP is the escape code for a SIB byte existing. Using RBP as a frame pointer allows more compact addressing of locals (by 1 byte per insn), but takes an 3 extra instructions to set up / tear down, and ties up a register. But it avoids further access to RSP, avoiding stack-sync uops. It could actually be a win sometimes.) (作为基址寄存器的 RSP 总是需要一个 SIB 字节;带有 base=RSP 的 ModRM 编码是现有 SIB 字节的转义码。使用 RBP 作为帧指针允许更紧凑的局部寻址(每个 insn 1 个字节),但是需要 3 个额外的指令来设置/拆除,并绑定一个寄存器。但它避免了进一步访问 RSP,避免了堆栈同步 uops。有时它实际上可能是一个胜利。)

One downside to leaving gaps between your locals is that it may defeat load or store merging opportunities later.在本地人之间留下差距的一个缺点是它可能会在以后打败加载或存储合并机会。 If you (the compiler) need to copy 2 locals somewhere, you may be able to do it with a single qword load/store if they're adjacent.如果您(编译器)需要在某处复制 2 个本地人,如果它们相邻,您可以使用单个 qword 加载/存储来完成。 Compilers don't consider all the future tradeoffs for the function when deciding how to arrange locals on the stack , as far as I know.据我所知,编译器在决定如何在堆栈上排列局部变量时不会考虑函数的所有未来权衡 We want compilers to run quickly, and that means not always back-tracking to consider every possibility for rearranging locals, or various other things.我们希望编译器能够快速运行,这意味着并不总是回溯以考虑重新排列局部变量或其他各种事物的所有可能性。 If looking for an optimization would take quadratic time, or multiply the time taken for other steps by a significant constant, it had better be an important optimization.如果寻找优化需要二次时间,或者将其他步骤所需的时间乘以一个重要的常数,那么它最好是一个重要的优化。 (IDK how hard it might be to implement a search for opportunities to use push , especially if you keep it simple and don't spend time optimizing the stack layout for it.) (IDK 实现搜索使用push机会可能有多么困难,特别是如果您保持简单并且不花时间为其优化堆栈布局。)

However, assuming there are other locals which will be used later, we can allocate them in the gaps between any we spill early .但是,假设还有其他局部变量将在以后使用,我们可以将它们分配到任何我们早期溢出的间隙中 So the space doesn't have to be wasted, we can simply come along later and use mov [rsp+12], eax to store between two 32-bit values we pushed.因此不必浪费空间,我们可以稍后简单地使用mov [rsp+12], eax来存储我们推送的两个 32 位值。


A tiny array of long , with non-constant contents一个包含非常量内容的long小数组

int ext_longarr(long *);
void longarr_arg(long a, long b, long c) {
    long arr[] = {a,b,c};
    ext_longarr(arr);
}

gcc/clang/ICC/MSVC follow their normal pattern, and use mov stores: gcc/clang/ICC/MSVC 遵循它们的正常模式,并使用mov存储:

longarr_arg(long, long, long):                     # @longarr_arg(long, long, long)
    sub     rsp, 24
    mov     rax, rsp                 # this is clang being silly
    mov     qword ptr [rax], rdi     # it could have used [rsp] for the first store at least,
    mov     qword ptr [rax + 8], rsi   # so it didn't need 2 reg,reg MOVs to avoid clobbering RDI before storing it.
    mov     qword ptr [rax + 16], rdx
    mov     rdi, rax
    call    ext_longarr(long*)
    add     rsp, 24
    ret

But it could have stored an array of the args like this:但它可以存储这样的参数数组:

longarr_arg_handtuned:
    push    rdx
    push    rsi
    push    rdi                 # leave stack 16B-aligned
    mov     rsp, rdi
    call    ext_longarr(long*)
    add     rsp, 24
    ret

With more args, we start to get more noticeable benefits especially in code-size when more of the total function is spent storing to the stack.有了更多的参数,我们开始获得更显着的好处,尤其是在代码大小方面,当更多的总函数用于存储到堆栈时。 This is a very synthetic example that does nearly nothing else.这是一个非常综合的示例,几乎没有其他任何作用。 I could have used volatile int a = 1;我可以使用volatile int a = 1; , but some compilers treat that extra-specially. ,但有些编译器会特别对待它。


Reasons for not building stack frames gradually逐步构建堆栈框架的原因

(probably wrong) Stack unwinding for exceptions, and debug formats, I think don't support arbitrary playing around with the stack pointer. (可能错误)堆栈展开异常和调试格式,我认为不支持随意使用堆栈指针。 So at least before making any call instructions, a function is supposed to have offset RSP as much as its going to for all future function calls in this function.因此,至少在发出任何call指令之前,函数应该具有与此函数中所有未来函数调用相同的偏移量 RSP。

But that can't be right, because alloca and C99 variable-length arrays would violate that.但这不可能是正确的,因为alloca和 C99 可变长度数组会违反这一点。 There may be some kind of toolchain reason outside the compiler itself for not looking for this kind of optimization.编译器本身之外可能有某种工具链原因不寻找这种优化。

This gcc mailing list post about disabling -maccumulate-outgoing-args for tune=default (in 2014) was interesting . 这个 gcc 邮件列表帖子关于禁用-maccumulate-outgoing-args for tune=default (in 2014) 很有趣 It pointed out that more push/pop led to larger unwind info ( .eh_frame section), but that's metadata that's normally never read (if no exceptions), so larger total binary but smaller / faster code.它指出更多的推送/弹出导致更大的展开信息( .eh_frame部分),但这是通常永远不会读取的元数据(如果没有例外),因此更大的总二进制但更小/更快的代码。 Related: this shows what -maccumulate-outgoing-args does for gcc code-gen.相关: 这显示了-maccumulate-outgoing-args对 gcc 代码生成的作用。

Obviously the examples I chose were trivial, where we're push ing the input parameters unmodified.显然,我选择的示例是微不足道的,我们在其中push未修改的输入参数。 More interesting would be when we calculate some things in registers from the args (and data they point to, and globals, etc.) before having a value we want to spill.更有趣的是,当我们在获得想要溢出的值之前,根据 args(以及它们指向的数据和全局变量等)计算寄存器中的某些内容。

If you have to spill/reload anything between function entry and later push es, you're creating extra stack-sync uops on Intel.如果您必须在函数入口和稍后的push之间溢出/重新加载任何内容,那么您将在 Intel 上创建额外的堆栈同步 uops。 On AMD, it could still be a win to do push rbx / blah blah / mov [rsp-32], eax (spill to the red zone) / blah blah / push rcx / imul ecx, [rsp-24], 12345 (reload the earlier spill from what's still the red-zone, with a different offset)在 AMD 上,执行push rbx / blah blah / mov [rsp-32], eax push rbx mov [rsp-32], eax (溢出到红色区域)/ blah blah / push rcx / imul ecx, [rsp-24], 12345 (从仍然是红色区域的地方重新加载较早的溢出,使用不同的偏移量)

Mixing push and [rsp] addressing modes is less efficient (on Intel CPUs because of stack-sync uops), so compilers would have to carefully weight the tradeoffs to make sure they're not making things slower.混合push[rsp]寻址模式效率较低(在 Intel CPU 上,因为堆栈同步 uops),因此编译器必须仔细权衡权衡以确保它们不会使事情变慢。 sub / mov is well-known to work well on all CPUs, even though it can be costly in code-size, especially for small constants.众所周知, sub / mov在所有 CPU 上都能很好地工作,尽管它的代码大小可能会很昂贵,尤其是对于小常量。

"It's hard to keep track of the offsets" is a totally bogus argument. “很难跟踪偏移量”是一个完全虚假的论点。 It's a computer;这是一台电脑; re-calculating offsets from a changing reference is something it has to do anyway when using push to put function args on the stack.在使用push将函数 args 放在堆栈上时,它无论如何都必须根据不断变化的引用重新计算偏移量。 I think compilers could run into problems (ie need more special-case checks and code, making them compile slower) if they had more than 128B of locals, so you couldn't always mov store below RSP (into what's still the red-zone) before moving RSP down with future push instructions.我认为编译器可能会遇到问题(即需要更多的特殊情况检查和代码,使它们编译更慢),如果他们有超过 128B 的本地人,所以你不能总是mov存储低于 RSP(进入仍然是红色区域的) 在使用未来的push指令向下移动 RSP 之前。

Compilers already consider multiple tradeoffs, but currently growing the stack frame gradually isn't one of the things they consider.编译器已经考虑了多种权衡,但目前逐渐增加堆栈框架并不是他们考虑的事情之一。 push wasn't as efficient before Pentium-M introduce the stack engine, so efficient push even being available is a somewhat recent change as far as redesigning how compilers think about stack layout choices.在 Pentium-M 引入堆栈引擎之前, push效率并不高,因此就重新设计编译器如何考虑堆栈布局选择而言,即使是可用的高效push也是最近的一个变化。

Having a mostly-fixed recipe for prologues and for accessing locals is certainly simpler.拥有一个基本固定的序言和访问当地人的食谱当然更简单。

This requires disabling stack frames as well though.不过,这也需要禁用堆栈帧。

It doesn't, actually.事实上,它没有。 Simple stack frame initialisation can use either enter or push ebp \\ mov ebp, esp \\ sub esp, x (or instead of the sub, a lea esp, [ebp - x] can be used).简单的堆栈帧初始化可以使用enterpush ebp \\ mov ebp, esp \\ sub esp, x (或者可以使用lea esp, [ebp - x]代替 sub lea esp, [ebp - x] )。 Instead of or additionally to these, values can be pushed onto the stack to initialise the variables, or just pushing any random register to move the stack pointer without initialising to any certain value.除了这些之外,还可以将值推送到堆栈上以初始化变量,或者只是推送任何随机寄存器以移动堆栈指针而不初始化任何特定值。

Here's an example (for 16-bit 8086 real/V 86 Mode) from one of my projects: https://bitbucket.org/ecm/symsnip/src/ce8591f72993fa6040296f168c15f3ad42193c14/binsrch.asm#lines-1465这是我的一个项目中的一个示例(用于 16 位 8086 real/V 86 模式): https : //bitbucket.org/ecm/symsnip/src/ce8591f72993fa6040296f168c15f3ad42193c14/binsrch.asm5#lines-146

save_slice_farpointer:
[...]
.main:
[...]
    lframe near
    lpar word,  segment
    lpar word,  offset
    lpar word,  index
    lenter
    lvar word,  orig_cx
     push cx
    mov cx, SYMMAIN_index_size
    lvar word,  index_size
     push cx
    lvar dword, start_pointer
     push word [sym_storage.main.start + 2]
     push word [sym_storage.main.start]

The lenter macro sets up (in this case) only push bp \\ mov bp, sp and then lvar sets up numeric defs for offsets (from bp) to variables in the stack frame. lenter 宏设置(在这种情况下)仅push bp \\ mov bp, sp和 lvar 设置为偏移量(从 bp)到堆栈帧中的变量的数字定义。 Instead of subtracting from sp, I initialise the variables by pushing into their respective stack slots (which also reserves the stack space needed).我没有从 sp 中减去,而是通过推入它们各自的堆栈槽(这也保留了所需的堆栈空间)来初始化变量。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM