简体   繁体   English

在不是地址/指针的值上使用 LEA?

[英]Using LEA on values that aren't addresses / pointers?

I was trying to understand how Address Computation Instruction works, especially with leaq command.我试图了解地址计算指令是如何工作的,尤其是使用leaq命令。 Then I get confused when I see examples using leaq to do arithmetic computation.然后当我看到使用leaq进行算术计算的示例时,我感到困惑。 For example, the following C code,例如,下面的 C 代码,

long m12(long x) {
return x*12;
}

In assembly,在组装中,

leaq (%rdi, %rdi, 2), %rax
salq $2, $rax

If my understanding is right, leaq should move whatever address (%rdi, %rdi, 2) , which should be 2*%rdi+%rdi , evaluate to into %rax .如果我的理解是正确的, leaq 应该移动任何地址(%rdi, %rdi, 2) ,它应该是2*%rdi+%rdi ,评估为%rax What I get confused is since value x is stored in %rdi , which is just memory address, why does times %rdi by 3 then left shift this memory address by 2 is equal to x times 12?我感到困惑的是,由于值 x 存储在%rdi ,这只是内存地址,为什么 %rdi 乘以 3 然后将此内存地址左移 2 等于 x 乘以 12? Isn't that when we times %rdi by 3, we jump to another memory address which does not hold value x?这不是当我们将%rdi乘以 3 时,我们会跳转到另一个不包含值 x 的内存地址吗?

lea (see Intel's instruction-set manual entry) is a shift-and-add instruction that uses memory-operand syntax and machine encoding. lea (请参阅 Intel 的指令集手册条目)是一种使用内存操作数语法和机器编码的移位加法指令。 This explains the name, but it's not the only thing it's good for.这解释了这个名字,但这并不是它唯一的好处。 It never actually accesses memory, so it's like using & in C.它从不实际访问内存,因此就像在 C 中使用&一样。

See for example How to multiply a register by 37 using only 2 consecutive leal instructions in x86?请参见示例如何在 x86 中仅使用 2 个连续的 leal 指令将寄存器乘以 37?

In C, it's like uintptr_t foo = &arr[idx] .在 C 中,它就像uintptr_t foo = &arr[idx] Note the & to give you the result of arr + idx , including scaling for the object size of arr .请注意&为您提供arr + idx的结果,包括缩放arr的对象大小。 In C, this would be abuse of the language syntax and types, but in x86 assembly pointers and integers are the same thing.在 C 中,这将是对语言语法和类型的滥用,但在 x86 汇编中,指针和整数是一回事。 Everything is just bytes, and it's up to the program put instructions in the right order to get useful results.一切都只是字节,这取决于程序以正确的顺序放置指令以获得有用的结果。


The original designer / architect of 8086's instruction set ( Stephen Morse ) might or might not have had pointer math in mind as the main use-case, but modern compilers think of it as just another option for doing arithmetic on pointers / integers, and that's how you should think of it, too. 8086 指令集的最初设计者/架构师 ( Stephen Morse ) 可能会也可能不会将指针数学作为主要用例,但现代编译器认为它只是对指针/整数进行算术运算的另一种选择,那就是你也应该怎么想。

(Note that 16-bit addressing modes don't include shifts, just [BP|BX] + [SI|DI] + disp8/disp16 , so LEA wasn't as useful for non-pointer math before 386. See this answer for more about 32/64-bit addressing modes, although that answer uses Intel syntax like [rax + rdi*4] instead of the AT&T syntax used in this question. x86 machine code is the same regardless of what syntax you use to create it.) (请注意,16 位寻址模式不包括移位,仅[BP|BX] + [SI|DI] + disp8/disp16 ,因此 LEA 对于 386 之前的非指针数学没有那么有用。请参阅此答案有关 32/64 位寻址模式的更多信息,尽管该答案使用[rax + rdi*4]类的 Intel 语法而不是此问题中使用的 AT&T 语法。无论您使用何种语法来创建它,x86 机器代码都是相同的。 )

Maybe the 8086 architects did simply want to expose the address-calculation hardware for arbitrary uses because they could do it without using a lot of extra transistors.也许 8086 架构师只是想公开地址计算硬件以供任意用途,因为他们可以在不使用大量额外晶体管的情况下做到这一点。 The decoder already has to be able to decode addressing modes, and other parts of the CPU have to be able to do address calculations.解码器必须能够解码寻址模式,而 CPU 的其他部分必须能够进行地址计算。 Putting the result in a register instead of using it with a segment-register value for memory access doesn't take many extra transistors.将结果放在寄存器中而不是将其与段寄存器值一起用于内存访问不会占用许多额外的晶体管。 Ross Ridge confirms that LEA on original 8086 reuses the CPUs effective-address decoding and calculation hardware. Ross Ridge 确认原始 8086 上的 LEA 重用了 CPU 的有效地址解码和计算硬件。


Note that most modern CPUs run LEA on the same ALUs as normal add and shift instructions .请注意,大多数现代 CPU 在与普通加法和移位指令相同的 ALU 上运行 LEA They have dedicated AGUs (address-generation units), but only use them for actual memory operands.它们有专用的 AGU(地址生成单元),但仅将它们用于实际的内存操作数。 In-order Atom is one exception;有序 Atom 是一个例外。 LEA runs earlier in the pipeline than the ALUs: inputs have to be ready sooner, but outputs are also ready sooner. LEA 比 ALU 更早地在管道中运行:输入必须更快地准备好,但输出也更快地准备好。 Out-of-order execution CPUs (the vast majority for modern x86) don't want LEA to interfere with actual loads/stores, so they run it on an ALU.乱序执行 CPU(现代 x86 的绝大多数)不希望 LEA 干扰实际加载/存储,因此它们在 ALU 上运行它。

lea has good latency and throughput, but not as good throughput as add or mov r32, imm32 on most CPUs, so only use lea when you can save an instructions with it instead of add . lea具有良好的延迟和吞吐量,但在大多数 CPU 上不如addmov r32, imm32吞吐量好,因此仅当您可以使用lea而不是add保存指令时才使用它。 (See Agner Fog's x86 microarch guide and asm optimization manual .) (请参阅Agner Fog 的 x86 微架构指南和 asm 优化手册。)


The internal implementation is irrelevant, but it's a safe bet that decoding the operands to LEA shares transistors with decoding addressing modes for any other instruction .内部实现无关紧要,但可以肯定的是,将操作数解码为 LEA 与任何其他指令的解码寻址模式共享晶体管 (So there is hardware reuse / sharing even on modern CPUs that don't execute lea on an AGU.) Any other way of exposing a multi-input shift-and-add instruction would have taken a special encoding for the operands. (因此,即使在不在 AGU 上执行lea现代 CPU 上也存在硬件重用/共享。)任何其他公开多输入移位和加法指令的方法都会对操作数进行特殊编码。

So 386 got a shift-and-add ALU instruction for "free" when it extended the addressing modes to include scaled-index, and being able to use any register in an addressing mode made LEA much easier to use for non-pointers, too.因此,386 在扩展寻址模式以包括缩放索引时获得了“免费”的移位和加法 ALU 指令,并且能够在寻址模式下使用任何寄存器也使 LEA 更易于用于非指针.

x86-64 got cheap access to the program counter ( instead of needing to read what call pushed ) "for free" via LEA because it added the RIP-relative addressing mode, making access to static data significantly cheaper in x86-64 position-independent code than in 32-bit PIC. X86-64到了程序计数器(便宜的接入,而不是需要读什么call推送)“免费”通过执法机关,因为它增加了RIP-相对寻址方式,使得访问静态数据X86-64位置无关显著便宜代码比 32 位 PIC。 (RIP-relative does need special support in the ALUs that handle LEA, as well as the separate AGUs that handle actual load/store addresses. But no new instruction was needed.) (相对 RIP 确实需要处理 LEA 的 ALU 以及处理实际加载/存储地址的单独 AGU 中的特殊支持。但不需要新指令。)


It's just as good for arbitrary arithmetic as for pointers, so it's a mistake to think of it as being intended for pointers these days .它对于任意算术和对于指针一样好,所以现在认为它是用于指针是错误的 It's not an "abuse" or "trick" to use it for non-pointers, because everything's an integer in assembly language.将它用于非指针并不是一种“滥用”或“技巧”,因为在汇编语言中一切都是整数。 It has lower throughput than add , but it's cheap enough to use almost all the time when it saves even one instruction.它的吞吐量比add低,但它便宜到几乎可以一直使用,甚至可以节省一条指令。 But it can save up to three instructions:但它最多可以保存三个指令:

;; Intel syntax.
lea  eax, [rdi + rsi*4 - 8]   ; 3 cycle latency on Intel SnB-family
                              ; 2-component LEA is only 1c latency

 ;;; without LEA:
mov  eax, esi             ; maybe 0 cycle latency, otherwise 1
shl  eax, 2               ; 1 cycle latency
add  eax, edi             ; 1 cycle latency
sub  eax, 8               ; 1 cycle latency

On some AMD CPUs, even a complex LEA is only 2 cycle latency, but the 4-instruction sequence would be 4 cycle latency from esi being ready to the final eax being ready.在某些 AMD CPU 上,即使是复杂的 LEA 也只有 2 个周期的延迟,但 4 条指令序列将从esi准备好到最终eax准备好需要 4 个周期的延迟。 Either way, this saves 3 uops for the front-end to decode and issue, and that take up space in the reorder buffer all the way until retirement.无论哪种方式,这都为前端解码和发布节省了 3 uop,并且一直占用了重新排序缓冲区中的空间,直到退役。

lea has several major benefits , especially in 32/64-bit code where addressing modes can use any register and can shift: lea有几个主要优点,尤其是在寻址模式可以使用任何寄存器并且可以移位的 32/64 位代码中:

  • non-destructive: output in a register that isn't one of the inputs .非破坏性:在不是输入之一的寄存器中输出 It's sometimes useful as just a copy-and-add like lea 1(%rdi), %eax or lea (%rdx, %rbp), %ecx .有时它只是像lea 1(%rdi), %eaxlea (%rdx, %rbp), %ecx这样的复制和添加一样有用。
  • can do 3 or 4 operations in one instruction (see above).可以在一条指令中执行 3 或 4 个操作(见上文)。
  • Math without modifying EFLAGS , can be handy after a test before a cmovcc .数学无需修改 EFLAGS ,在cmovcc之前进行测试后可以很方便。 Or maybe in an add-with-carry loop on CPUs with partial-flag stalls.或者可能在带有部分标志停顿的 CPU 上的带进位循环中。
  • x86-64: position independent code can use a RIP-relative LEA to get a pointer to static data. x86-64:位置无关代码可以使用相对于 RIP 的 LEA来获取指向静态数据的指针。

    7-byte lea foo(%rip), %rdi is slightly larger and slower than mov $foo, %edi (5 bytes), so prefer mov r32, imm32 in position-dependent code on OSes where symbols are in the low 32 bits of virtual address space, like Linux. 7 字节lea foo(%rip), %rdimov $foo, %edi (5 字节)稍大mov $foo, %edi ,因此在符号位于低 32 位的操作系统上的位置相关代码中更喜欢mov r32, imm32虚拟地址空间,如 Linux。 You may need to disable the default PIE setting in gcc to use this.您可能需要禁用 gcc 中的默认 PIE 设置才能使用它。

    In 32-bit code, mov edi, OFFSET symbol is similarly shorter and faster than lea edi, [symbol] .在 32 位代码中, mov edi, OFFSET symbol同样比lea edi, [symbol]更短和更快。 (Leave out the OFFSET in NASM syntax.) RIP-relative isn't available and addresses fit in a 32-bit immediate, so there's no reason to consider lea instead of mov r32, imm32 if you need to get static symbol addresses into registers. (省略 NASM 语法中的OFFSET 。)RIP 相对不可用且地址适合 32 位立即数,因此如果您需要将静态符号地址放入寄存器,则没有理由考虑lea而不是mov r32, imm32 .

Other than RIP-relative LEA in x86-64 mode, all of these apply equally to calculating pointers vs. calculating non-pointer integer add / shifts.除了 x86-64 模式下的 RIP 相对 LEA 之外,所有这些都同样适用于计算指针与计算非指针整数加法/移位。

See also the tag wiki for assembly guides / manuals, and performance info.另请参阅标记 wiki以获取组装指南/手册和性能信息。


Operand-size vs. address-size for x86-64 lea x86-64 lea操作数大小与地址大小

See also Which 2's complement integer operations can be used without zeroing high bits in the inputs, if only the low part of the result is wanted?另请参阅如果只需要结果的低部分,可以使用哪 2 的补码整数运算而无需将输入中的高位清零? . . 64-bit address size and 32-bit operand size is the most compact encoding (no extra prefixes), so prefer lea (%rdx, %rbp), %ecx when possible instead of 64-bit lea (%rdx, %rbp), %rcx or 32-bit lea (%edx, %ebp), %ecx . 64 位地址大小和 32 位操作数大小是最紧凑的编码(没有额外的前缀),所以尽可能使用lea (%rdx, %rbp), %ecx而不是 64 位lea (%rdx, %rbp), %rcx或 32 位lea (%edx, %ebp), %ecx

x86-64 lea (%edx, %ebp), %ecx is always a waste of an address-size prefix vs. lea (%rdx, %rbp), %ecx , but 64-bit address / operand size is obviously required for doing 64-bit math. x86-64 lea (%edx, %ebp), %ecx总是浪费地址大小前缀 vs. lea (%rdx, %rbp), %ecx ,但显然需要 64 位地址/操作数大小做 64 位数学。 (Agner Fog's objconv disassembler even warns about useless address-size prefixes on LEA with a 32-bit operand-size.) (Agner Fog 的 objconv 反汇编器甚至警告 LEA 上无用的地址大小前缀和 32 位操作数大小。)

Except maybe on Ryzen, where Agner Fog reports that 32-bit operand size lea in 64-bit mode has an extra cycle of latency.除了可能在 Ryzen 上,Agner Fog 报告说 64 位模式下的 32 位操作数大小lea有一个额外的延迟周期。 I don't know if overriding the address-size to 32-bit can speed up LEA in 64-bit mode if you need it to truncate to 32-bit.如果需要将地址大小截断为 32 位,我不知道将地址大小覆盖为 32 位是否可以在 64 位模式下加速 LEA。


This question is a near-duplicate of the very-highly-voted What's the purpose of the LEA instruction?这个问题几乎重复了投票率极高的 LEA 指令的目的是什么? , but most of the answers explain it in terms of address calculation on actual pointer data. ,但大多数答案都是根据实际指针数据的地址计算来解释的。 That's only one use.这只是一种用途。

leaq doesn't have to operate on memory addresses, and it computes an address, it doesn't actually read from the result, so until a mov or the like tries to use it, it's just an esoteric way to add one number, plus 1, 2, 4 or 8 times another number (or the same number in this case). leaq不必对内存地址进行操作,并计算一个地址,它实际上并没有从结果,所以直到一个mov或类似尝试使用它,它只是一个深奥的方法来添加一个数字,加1、2、4 或 8 次另一个数字(或在这种情况下相同的数字)。 It's frequently "abused" for mathematical purposes, as you see.如您所见,出于数学目的,它经常被“滥用” 2*%rdi+%rdi is just 3 * %rdi , so it's computing x * 3 without involving the multiplier unit on the CPU. 2*%rdi+%rdi只是3 * %rdi ,所以它计算x * 3而不涉及 CPU 上的乘法器单元。

Similarly, left shifting, for integers, doubles the value for every bit shifted (every zero added to the right), thanks to the way binary numbers work (the same way in decimal numbers, adding zeroes on the right multiplies by 10).类似地,由于二进制数的工作方式(十进制数的工作方式相同,在右侧添加零乘以 10),对于整数,左移将每移位的值加倍(向右添加每个零)。

So this is abusing the leaq instruction to accomplish multiplication by 3, then shifting the result to achieve a further multiplication by 4, for a final result of multiplying by 12 without ever actually using a multiply instruction (which it presumably believes would run more slowly, and for all I know it could be right; second-guessing the compiler is usually a losing game).所以这是滥用leaq指令来完成乘以 3,然后将结果移位以实现进一步乘以 4,最终结果乘以 12,而实际上从未实际使用乘法指令(它大概认为会运行得更慢,据我所知,这可能是对的;猜测编译器通常是一场失败的游戏)。

: To be clear, it's not abuse in the sense of misuse , just using it in a way that doesn't clearly align with the implied purpose you'd expect from its name. :明确地说,它不是误用意义上的滥用,只是以一种与您从其名称中期望的隐含目的不明确一致的方式使用它。 It's 100% okay to use it this way.以这种方式使用它是 100% 可以的。

LEA is for calculating the address . LEA 用于计算地址 It doesn't dereference the memory address它不会取消引用内存地址

It should be much more readable in Intel syntax它应该在英特尔语法中更具可读性

m12(long):
  lea rax, [rdi+rdi*2]
  sal rax, 2
  ret

So the first line is equivalent to rax = rdi*3 Then the left shift is to multiply rax by 4, which results in rdi*3*4 = rdi*12所以第一行等价于rax = rdi*3然后左移就是将 rax 乘以 4,得到rdi*3*4 = rdi*12

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM