在 64 位 x64/Amd64 处理器上执行 8 位和 64 位指令的时序

Question

Is there any execution timing difference between 8 it and 64 bit instructions on 64 bit x64/Amd64 processor, when those instructions are similar/same except bit width? 8 it 和 64 位指令在 64 位 x64/Amd64 处理器上是否有任何执行时序差异，当这些指令除了位宽之外是相似/相同时？ Is there a way to find real processor timing of executing these 2 tiny assembly functions?有没有办法找到执行这两个微型汇编函数的真实处理器时序？

-Thanks. -谢谢。

; 64 bit instructions
add64:
     mov  $0x1, %rax
     add  $0x2, %rax
     ret

; 8 bit instructions
add8:
     mov  $0x1, %al
     add  $0x2, %al
     ret

Answer 1

Yes, there's a difference.是的，有区别。 mov $0x1, %al has a false dependency on the old value of RAX on most CPUs, including everything newer than Sandybridge. mov $0x1, %al在大多数 CPU 上对 RAX 的旧值有错误的依赖，包括比 Sandybridge 更新的所有 CPU。 It's a 2-input 1-output instruction;这是一个2输入1输出指令； from the CPU's point of view it's like add $1, %al as far as scheduling it independently or not relative to other uses of RAX.从 CPU 的角度来看，它就像add $1, %al一样独立或不相对于 RAX 的其他用途进行调度。 Only writing a 32 or 64-bit register starts a new dependency chain.仅写入 32 位或 64 位寄存器会启动新的依赖链。

This means the AL return value of your add8 function might not be ready until after a cache miss for some independent work the caller happened to be doing in EAX before the call, but the RAX result of add64 could be ready right away for out-of-order execution to get started on later instructions in the caller that use the return value.这意味着您的add8 function 的 AL 返回值可能无法准备好，直到调用者在调用之前恰好在 EAX 中执行的某些独立工作的缓存未命中后才准备好，但add64的 RAX 结果可能会立即准备好用于 out-of -order 执行以在调用者中使用返回值的后续指令上开始。 (Assuming their other inputs are also ready.) （假设他们的其他输入也准备好了。）

Why doesn't GCC use partial registers?为什么 GCC 不使用部分寄存器？ and和
How exactly do partial registers on Haswell/Skylake perform? Haswell/Skylake 上的部分寄存器的性能如何？ Writing AL seems to have a false dependency on RAX, and AH is inconsistent写AL好像对RAX有错误依赖，和AH不一致
and What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?和go 在预测现代超标量处理器上的操作延迟时有哪些考虑，我如何手动计算它们？ - Important background for understanding performance on modern OoO exec CPUs. - 了解现代 OoO exec CPU 性能的重要背景。

Their code-size also differs: Both the 8-bit instructions are 2 bytes long.它们的代码大小也不同：两条 8 位指令都是 2 个字节长。 (Thanks to the AL, imm8 short-form encoding; add $1, %dl would be 3 bytes). （感谢 AL，imm8 短格式编码； add $1, %dl将是 3 个字节）。 The RAX instructions are 7 and 4 bytes long. RAX 指令有 7 个和 4 个字节长。 This matters for L1i cache footprint (and on a large scale, for how many bytes have to get paged in from disk).这对于 L1i 缓存占用很重要（在大规模上，对于必须从磁盘调入多少字节）。 On a small scale, how many instructions can fit into a 16 or 32-byte fetch block if the CPU is doing legacy decode because the code wasn't already hot in the uop cache.在小范围内，如果 CPU 正在执行旧版解码，那么 16 或 32 字节的提取块可以容纳多少条指令，因为代码在 uop 缓存中还不是很热。 Also code-alignment of later instructions is affected by varying lengths of previous instructions, sometimes affecting which branches alias each other.后面指令的代码对齐也会受到前面指令的不同长度的影响，有时会影响哪些分支相互别名。

https://agner.org/optimize/ explains the details of the pipelines of various x86 microarchitectures, including front-end decoding effects that can make instruction-length matter beyond just code density in the I-cache / uop-cache. https://agner.org/optimize/解释了各种 x86 微架构的流水线细节，包括前端解码效果，可以使指令长度不仅仅是 I-cache / uop-cache 中的代码密度。

Generally 32-bit operand-size is the most efficient (for performance, and pretty good for code-size) .通常 32 位操作数大小是最有效的（就性能而言，并且对于代码大小来说非常好）。 32 and 8 are the operand-sizes that x86-64 can use without extra prefixes, and in practice with 8-bit to avoid stalls and badness you need more instructions or longer instructions because they don't zero-extend. 32 和 8 是 x86-64 可以在没有额外前缀的情况下使用的操作数大小，实际上使用 8 位来避免停顿和坏事，您需要更多指令或更长的指令，因为它们不会零扩展。 The advantages of using 32bit registers/instructions in x86-64 . 在 x86-64 中使用 32 位寄存器/指令的优点。

A few instructions are actually slower in the ALUs for 64-bit operand-size, not just front-end effects.对于 64 位操作数大小，ALU 中的一些指令实际上更慢，而不仅仅是前端效果。 That includes div on most CPUs, and imul on some older CPUs.这包括大多数 CPU 上的div和一些旧 CPU 上的imul 。 Also popcnt and bswap.还有 popcnt 和 bswap。 eg Trial-division code runs 2x faster as 32-bit on Windows than 64-bit on Linux例如，试除法代码在 Windows 上运行 32 位比在 Linux 上运行 64 位快 2 倍

Note that mov $0x1, %rax will assemble to 7 bytes with GAS, unless you use as -O2 (not the same as gcc -O2 , see this for examples) to get it to optimize to mov $1, %eax which exactly the same architectural effects, but is shorter (no REX or ModRM byte).请注意， mov $0x1, %rax将使用 GAS 组装成 7 个字节，除非您使用as -O2 （与gcc -O2不同，请参见示例）以使其优化为mov $1, %eax ，这正是相同的架构效果，但更短（没有 REX 或 ModRM 字节）。 Some assemblers do that optimization by default, but GAS doesn't.一些汇编程序默认情况下会进行优化，但 GAS 不会。 Why NASM on Linux changes registers in x86_64 assembly has more about why this optimization is safe and good, and why you should do it yourself in the source especially if your assembler doesn't do it for you. 为什么 Linux 上的 NASM 更改 x86_64 程序集中的寄存器有更多关于为什么这种优化是安全和良好的，以及为什么你应该在源代码中自己做，特别是如果你的汇编程序不为你做。

But other than the false dep and code-size, they're the same for the back-end of the CPU: all those instructions are single-uop and can run on any scalar-integer ALU execution port ¹ .但是除了错误的 dep 和代码大小之外，它们对于 CPU 的后端是相同的：所有这些指令都是单 uop 并且可以在任何标量整数 ALU 执行端口¹上运行。 ( https://uops.info/ has automated test results for every form of every unprivileged instruction). （ https://uops.info/对每条非特权指令的每种形式都有自动测试结果）。

Footnote 1 : Excavator (last-gen Bulldozer-family) can also run mov $imm, %reg on 2 more ports (AGU) for 32 and 64-bit operand-size.脚注 1 ：挖掘机（上一代 Bulldozer 系列）还可以在另外 2 个端口 (AGU) 上运行mov $imm, %reg ，用于 32 位和 64 位操作数大小。 But merging a new low-8 or low-16 into a full register needs an ALU port.但是将新的低 8 或低 16 合并到一个完整的寄存器中需要一个 ALU 端口。 So mov $1, %rax has 4/clock throughput on Excavator, but mov $1, %al only has 2/clock throughput.所以mov $1, %rax在 Excavator 上有 4/clock 的吞吐量，但是mov $1, %al只有 2/clock 的吞吐量。 (And of course only if you use a few different destination registers, not actually AL repeatedly; that would be a latency bottleneck of 1/clock because of the false dependency from writing a partial register on that microarchitecture.) （当然，仅当您使用几个不同的目标寄存器时，实际上不是重复使用 AL；这将是 1/clock 的延迟瓶颈，因为在该微架构上写入部分寄存器会产生错误的依赖性。）

Previous Bulldozer-family CPUs starting with Piledriver can run mov reg, reg (for r32 or r64) on EX0, EX1, AGU0, AGU1, while most ALU instructions including mov $imm, %reg can only run on EX0/1.以前从 Piledriver 开始的 Bulldozer 系列 CPU 可以在 EX0、EX1、AGU0、AGU1 上运行mov reg, reg （用于 r32 或 r64），而包括mov $imm, %reg在内的大多数 ALU 指令只能在 EX0/1 上运行。 Further extending the AGU port's capabilities to also handle mov-immediate was a new feature in Excavator.进一步扩展 AGU 端口的功能以处理 mov-immediate 是 Excavator 中的一项新功能。

Fortunately Bulldozer was obsoleted by AMD's much better Zen architecture which has 4 full scalar integer ALU ports / execution units.幸运的是，推土机被 AMD 更好的 Zen 架构淘汰了，该架构具有 4 个全标量 integer ALU 端口/执行单元。 (And a wider front end and a uop cache, good caches, and generally doesn't suck in a lot of the ways that Bulldozer sucked.) （还有一个更宽的前端和一个 uop 缓存，良好的缓存，并且通常不会像 Bulldozer 那样糟糕。）

Is there a way to measure* it?*有没有办法测量它？

yes, but generally not in a function you call with call .是的，但通常不在您使用call的 function 中。 Instead put it in an unrolled loop so you can run it lots of times with minimal other instructions.而是把它放在一个展开的循环中，这样你就可以用最少的其他指令运行它很多次。 Especially useful to look at CPU performance counter results to find front-end / back-end uop counts, as well as just the overall time for your loop.查看 CPU 性能计数器结果以查找前端/后端 uop 计数以及循环的总时间特别有用。

You can construct your loop to measure latency or throughput;您可以构建循环来测量延迟或吞吐量； see RDTSCP in NASM always returns the same value (timing a single instruction) .请参阅NASM 中的 RDTSCP 始终返回相同的值（计时单个指令）。 Also:还：

Assembly - How to score a CPU instruction by latency and throughput 汇编 - 如何通过延迟和吞吐量对 CPU 指令进行评分
Idiomatic way of performance evaluation? 绩效评估的惯用方式？
Can x86's MOV really be "free"? x86的MOV真的可以“免费”吗？ Why can't I reproduce this at all? 为什么我根本无法重现这个？ is a good specific example of constructing a microbenchmark to measure / prove something specific.是构建微基准以测量/证明特定事物的一个很好的具体示例。

Generally you don't need to measure yourself (although it's good to understand how, that helps you know what the measurements really mean).通常，您不需要测量自己（尽管了解如何测量很好，这有助于您了解测量的真正含义）。 People have already done that for most CPU microarchitectures.人们已经为大多数 CPU 微架构做到了这一点。 You can predict performance for a specific CPU for some loops (if you can assume no stalls or cache misses) based on analyzing the instructions.您可以基于分析指令来预测特定 CPU 的某些循环的性能（如果您可以假设没有停顿或缓存未命中）。 Often that can predict performance fairly accurately, but medium-length dependency chains that OoO exec can only partially hide makes it too hard to accurately predict or account for every cycle.通常这可以相当准确地预测性能，但是 OoO exec 只能部分隐藏的中等长度的依赖链使得准确预测或解释每个周期变得过于困难。

What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand? go 在预测现代超标量处理器上的操作延迟时有哪些考虑因素，我如何手动计算它们？ has links to lots of good details, and stuff about CPU internals.有很多好的细节的链接，以及关于 CPU 内部的东西。
How many CPU cycles are needed for each assembly instruction? 每条汇编指令需要多少个 CPU 周期？ (you can't add up a cycle count for each instruction; front-end and back-end throughput, and latency, could each be the bottleneck for a loop.) （您不能为每条指令添加循环计数；前端和后端吞吐量以及延迟都可能成为循环的瓶颈。）

在 64 位 x64/Amd64 处理器上执行 8 位和 64 位指令的时序

问题描述

1 个解决方案

解决方案1
4 2020-12-18 22:43:34

Is there a way to measure* it?*有没有办法测量它？

在 64 位 x64/Amd64 处理器上执行 8 位和 64 位指令的时序

问题描述

1 个解决方案

解决方案1 4 2020-12-18 22:43:34

Is there a way to measure it?有没有办法测量它？

解决方案1
4 2020-12-18 22:43:34

Is there a way to measure* it?*有没有办法测量它？