与mov reg，imm64相比，RIP相对寻址的执行情况如何？

Question

It is known fact that x86-64 instructions do not support 64-bit immediate values (except for mov). 众所周知，x86-64指令不支持64位立即值（mov除外）。 Hence, when migrating code from 32 to 64 bits, an instruction like this: 因此，在将代码从32位迁移到64位时，会执行如下指令：

    cmp rax, addr32

cannot be replaced with the following: 不能替换为以下内容：

    cmp rax, addr64

Under these circumstances, I'm considering two alternatives: (a) using a scratch register for loading the constant or (b) using rip-relative addressing. 在这种情况下，我正在考虑两种选择：（a）使用临时寄存器加载常数或（b）使用rip-relative寻址。 The two approaches look like this: 这两种方法看起来像这样：

    mov r11, addr64 ; scratch register
    cmp rax, r11

ptr64: dq addr64

...
     cmp rax, [rel ptr64]    ; encoded as cmp rax, [rip+offset]

I wrote a very simple loop to compare the performance of both approaches (which I paste below). 我写了一个非常简单的循环来比较两种方法的性能（我在下面粘贴）。 While (b) uses an indirect pointer, (a) has the the immediate encoded in the instruction (which could lead to a worse usage of i-cache). 虽然（b）使用间接指针，但（a）在指令中具有立即编码（这可能导致i-cache的使用更差）。 Surprisingly, I found that (b) run ~10% faster than (a). 令人惊讶的是，我发现（b）比（a）快〜10％。 Is this result something to be expected in more common real-world code? 这个结果在更常见的现实世界代码中应该是预期的吗？

true:  dq 0xFFFF0000FFFF0000
false: dq 0xAAAABBBBAAAABBBB

main:
    or rax, 1  ; rax is odd and constant "true" is even
    mov rcx, 0x1
    shl rcx, 30
branch:
    mov r11, 0xFFFF0000FFFF0000 ; not present in (b)
    cmp rax, r11                ; vs cmp rax, [rel true]
    je next
    add rax, 2
    loop branch

next:
    mov rax, 0
    ret

Answer 1

Surprisingly, I found that (b) run ~10% faster than (a) 令人惊讶的是，我发现（b）比（a）快〜10％

You probably tested on a CPU other than AMD Bulldozer-family or Ryzen, which have a fast loop instruction. 您可能在AMD Bulldozer系列或Ryzen以外的CPU上进行了测试，这些CPU具有快速loop指令。 On other CPUs, loop is very slow, mostly on purpose for historical reasons , so you bottleneck on it . 在其他CPU上， loop非常缓慢，主要是出于历史原因，因此您会遇到瓶颈 。 eg 7 uops, one per 5c throughput on Haswell. 例如7个uops，Haswell每5c吞吐量一个。

mov r64, imm64 is bad for uop cache throughput because of the large immediate taking 2 slots in Intel's uop cache. mov r64, imm64对uop缓存吞吐量不利，因为在英特尔的uop缓存中立即占用2个插槽。 (See the Sandybridge uop cache section in Agner Fog's microarch pdf ), and Which is faster, imm64 or m64 for x86-64? （参见Agner Fog的microarch pdf中的Sandybridge uop缓存部分），对于x86-64 ，哪个更快，imm64或m64？ where I listed the details. 在哪里我列出了细节。

Even apart from that, it's not too surprising that 1 extra uop in the loop makes it run slower . 即使除此之外，循环中的1个额外uop使其运行速度变慢并不令人惊讶 。 You're probably not on an AMD CPU (with single-uop / 1 per 2 clock loop ), because the extra mov in such a tiny loop would make more than 10% difference. 你可能不是在AMD CPU上（每2个时钟loop使用单uop / 1），因为在如此微小的循环中的额外mov会产生超过10％的差异。 Or no difference at all, since it's just 3 vs. 4 uops per 2 clocks, if that's correct that even tiny loop loops are limited to one jump per 2 clocks. 或者完全没有区别，因为每2个时钟只有3个对4个uop，如果这是正确的，即使是微小的loop循环也限制为每2个时钟跳一次。

On Intel, loop is 7 uops, one per 5 clocks throughput on most CPUs, so the 4-per-clock issue/rename bottleneck won't be what you're hitting. 在Intel上， loop是7 uops，大多数CPU每5个时钟吞吐量一个，因此每个4个时钟的问题/重命名瓶颈将不会是你所要达到的。 loop is micro-coded, so the front-end can't run from the loop buffer. loop是微编码的，因此前端不能从循环缓冲区运行。 (And Skylake CPUs have their LSD disabled by a microcode update to fix the partial-register erratum anyway.) So the mov r64,imm64 uop has to be re-read from the uop cache every time through the loop. （并且Skylake CPU通过微代码更新禁用了LSD以修复部分寄存器错误。）因此每次循环时都必须从uop缓存中重新读取mov r64,imm64 。

A load that hits in cache has very good throughput (2 loads per clock, and in this case micro-fusion means no extra uops to use a memory operand instead of register for cmp ). 在高速缓存中命中的负载具有非常好的吞吐量 （每个时钟2个负载， 在这种情况下，微融合意味着没有额外的uop来使用内存操作数而不是cmp的寄存器 ）。 So the main penalty in using a constant from memory is the extra cache footprint and cache misses, but your microbenchmark won't reveal that at all. 因此，从内存中使用常量的主要原因是额外的缓存占用空间和缓存未命中，但是您的微基准测试根本不会显示。 It also has no other pressure on the load ports. 它对装载端口没有其他压力。

In the general case: 在一般情况下：

If possible, use a RIP-relative lea to generate 64-bit address constants. 如果可能，使用RIP相对的lea生成64位地址常量。
eg lea rax, [rel addr64] . 例如lea rax, [rel addr64] 。 Yes, this takes an extra instruction to get the constant into a register. 是的，这需要额外的指令来将常量输入寄存器。 (BTW, just use default rel . You can use [abs fs:0] if you need it. （顺便说一下，只需使用default rel 。如果需要，你可以使用[abs fs:0] 。

You can avoid the extra instruction if you build position-dependent code with the default (small) code model, so static addresses fit in the low 32 bits of virtual address space and can be used as immediates . 如果使用默认（小）代码模型构建与位置相关的代码 ，则可以避免额外的指令， 因此静态地址适合虚拟地址空间的低32位并且可以用作中介 。 (Actually low 2GiB, so sign or zero extending both work). （实际上低2GiB，因此符号或零扩展两种工作）。 See 32-bit absolute addresses no longer allowed in x86-64 Linux? 请参阅x86-64 Linux中不再允许的32位绝对地址？ if gcc complains about absolute addressing; 如果gcc抱怨绝对解决; -pie is enabled by default on most distros. 大多数发行版默认启用-pie 。 This of course doesn't work in Linux shared libraries, which only support text relocations for 64-bit addresses. 这当然不适用于Linux共享库，它只支持64位地址的文本重定位。 But you should avoid relocations whenever possible by using lea to make position-indepdendent code. 但是你应该尽可能地避免重定位，使用lea来创建位置无关的代码。

Most integer build-time constants fit in 32 bits, so you can use cmp r64, imm32 or cmp r32, imm32 even in PIC code. 大多数整数构建时常量适合32位，因此即使在PIC代码中也可以使用cmp r64, imm32或cmp r32, imm32 。

If you do need a 64-bit non-address constant, try to hoist the mov r64, imm64 out of a loop. 如果确实需要64位非地址常量，请尝试将mov r64, imm64从循环中mov r64, imm64出来。 Your cmp loop would have been fine if the mov wasn't inside the loop. 如果mov不在循环内，你的cmp循环会很好。 x86-64 has enough registers that you (or the compiler) can usually avoid reloads inside inner-most loops in integer code. x86-64有足够的寄存器，您（或编译器）通常可以避免在整数代码中的最内层循环内重新加载。

与mov reg，imm64相比，RIP相对寻址的执行情况如何？

问题描述

1 个解决方案

解决方案1
3 2018-01-17 09:12:38

In the general case: 在一般情况下：

与mov reg，imm64相比，RIP相对寻址的执行情况如何？

问题描述

1 个解决方案

解决方案1 3 2018-01-17 09:12:38

In the general case: 在一般情况下：

解决方案1
3 2018-01-17 09:12:38