x86-64 相对 jmp 性能

Question

I'm currently doing an assignment that measures the performance of various x86-64 commands (at&t syntax).我目前正在做一项测量各种 x86-64 命令（at&t 语法）性能的作业。

The command I'm somewhat confused on is the "unconditional jmp" command.我有点困惑的命令是“无条件 jmp”命令。 This is how I've implemented it:这就是我实现它的方式：

    .global uncond
uncond:

.rept 10000
jmp . + 2
.endr


mov $10000, %rax
ret

It's fairly simple.这相当简单。 The code creates a function called "uncond" which uses the .rept directive to call the jmp command 10000 times, then sets the return value to the number of times you called the jmp command.该代码创建了一个名为“uncond”的函数，它使用 .rept 指令调用 jmp 命令 10000 次，然后将返回值设置为调用 jmp 命令的次数。

"." “。” in at&t syntax means the current address, which I increase by 2 bytes in order to account for the jmp instruction itself (so jmp . + 2 should simply move to the next instruction).在 at&t 语法中表示当前地址，我将其增加 2 个字节以说明 jmp 指令本身（因此 jmp . + 2 应该简单地移动到下一条指令）。

Code that I haven't shown calculate the number of cycles it takes to process the 10000 commands.我没有显示的代码计算处理 10000 个命令所需的周期数。

My results say jmp is pretty slow (takes 10 cycles to process a single jmp instruction) - but from what I understand about pipelining, unconditional jumps should be very fast (no branch prediction errors).我的结果表明 jmp 非常慢（处理单个 jmp 指令需要 10 个周期） - 但根据我对流水线的理解，无条件跳转应该非常快（没有分支预测错误）。

Am I missing something?我错过了什么吗？ Is my code wrong?我的代码有错吗？

Answer 1

The CPU isn't optimized for no-op jmp instructions , so it doesn't handle the special case of continuing to decode and pipeline jmp instructions that just jump to the next insn. CPU 没有针对 no-op jmp指令进行优化，因此它不处理继续解码和流水线 jmp 指令的特殊情况，这些指令只是跳转到下一个 insn。

CPUs are optimized for loops, though.不过，CPU 已针对循环进行了优化。 jmp . will run at one insn per clock on many CPUs, or one per 2 clocks on some CPUs.将在许多 CPU 上以每时钟一个 insn 运行，或在某些 CPU 上每 2 个时钟运行一个。

A jump creates a bubble in instruction fetching.跳转会在指令获取中产生气泡。 A single well-predicted jump is ok, but running nothing but jumps is problematic.一次准确预测的跳跃是可以的，但是除了跳跃什么都不运行是有问题的。 I reproduced your results on a core2 E6600 (Merom/Conroe microarch):我在 core2 E6600 (Merom/Conroe microarch) 上复制了你的结果：

# jmp-test.S
.globl _start
_start:

    mov $100000, %ecx
jmp_test:
    .rept 10000
    jmp . + 2
    .endr

    dec %ecx
    jg jmp_test


    mov $231, %eax
    xor %ebx,%ebx
    syscall          #  exit_group(0)

build and run with:构建并运行：

gcc -static -nostartfiles jmp-test.S
perf stat -e task-clock,cycles,instructions,branches,branch-misses ./a.out

 Performance counter stats for './a.out':

       3318.616490      task-clock (msec)         #    0.997 CPUs utilized          
     7,940,389,811      cycles                    #    2.393 GHz                      (49.94%)
     1,012,387,163      instructions              #    0.13  insns per cycle          (74.95%)
     1,001,156,075      branches                  #  301.679 M/sec                    (75.06%)
           151,609      branch-misses             #    0.02% of all branches          (75.08%)

       3.329916991 seconds time elapsed

From another run:从另一个运行：

 7,886,461,952      L1-icache-loads           # 2377.687 M/sec                    (74.95%)
     7,715,854      L1-icache-load-misses     #    2.326 M/sec                    (50.08%)
 1,012,038,376      iTLB-loads                #  305.119 M/sec                    (75.06%)
           240      iTLB-load-misses          #    0.00% of all iTLB cache hits   (75.02%)

(Numbers in (%) at the end of each line are how much of the total run time that counter was active for: perf has to multiplex for you when you ask it to count more things than the HW can count at once). （每行末尾的 (%) 中的数字是计数器处于活动状态的总运行时间的多少：当您要求它计算比硬件一次可以计算的更多的东西时， perf必须为您多路复用）。

So it's not actually I-cache misses, it's just instruction fetch/decode frontend bottlenecks caused by constant jumps.所以它实际上不是 I-cache 未命中，它只是由不断跳转引起的指令获取/解码前端瓶颈。

My SnB machine is broken, so I can't test numbers on it, but 8 cycles per jmp sustained throughput is pretty close to your results (which were probably from a different microarchitecture).我的 SnB 机器坏了，所以我无法测试它的数字，但是每 jmp 持续吞吐量 8 个周期与您的结果非常接近（可能来自不同的微体系结构）。

For more details, see http://agner.org/optimize/ , and other links from the x86 tag wiki.有关更多详细信息，请参阅http://agner.org/optimize/以及来自x86标签 wiki 的其他链接。

x86-64 相对 jmp 性能

问题描述

1 个解决方案

解决方案1
1 已采纳 2016-04-24 05:38:07

x86-64 相对 jmp 性能

问题描述

1 个解决方案

解决方案1 1 已采纳 2016-04-24 05:38:07

解决方案1
1 已采纳 2016-04-24 05:38:07