为什么 cmp 指令花费太多时间？

Question

I am trying to do profile with libunwind (using linux perf), with perf top monitoring the target process, I get this assembly time cost screen:我正在尝试使用 libunwind（使用 linux perf）进行分析，使用perf top监视目标进程，我得到了这个组装时间成本屏幕：

  0.19 │       mov    %rcx,0x18(%rsp)                                                                                                                                    ▒
       │     trace_lookup():                                                                                                                                             ▒
  1.54 │       mov    0x8(%r9),%rcx                                                                                                                                      ▒
       │     _ULx86_64_tdep_trace():                                                                                                                                     ▒
  0.52 │       and    $0x1,%edx                                                                                                                                          ◆
  0.57 │       mov    %r14d,0xc(%rsp)                                                                                                                                    ▒
  0.40 │       mov    0x78(%rsp),%r10                                                                                                                                    ▒
  1.24 │       sub    %rdx,%r15                                                                                                                                          ▒
       │     trace_lookup():                                                                                                                                             ▒
  0.35 │       shl    %cl,%r12d                                                                                                                                          ▒
       │     _ULx86_64_tdep_trace():                                                                                                                                     ▒
  2.18 │       mov    0x90(%rsp),%r8                                                                                                                                     ▒
       │     trace_lookup():                                                                                                                                             ▒
  0.46 │       imul   %r15,%r13                                                                                                                                          ▒
       │     _ULx86_64_tdep_trace():                                                                                                                                     ▒
  0.59 │       mov    %r15,0x88(%rsp)                                                                                                                                    ▒
       │     trace_lookup():                                                                                                                                             ▒
  0.50 │       lea    -0x1(%r12),%rdx                                                                                                                                    ▒
  1.22 │       shr    $0x2b,%r13                                                                                                                                         ▒
  0.37 │       and    %r13,%rdx                                                                                                                                          ▒
  0.57 │177:   mov    %rdx,%rbp                                                                                                                                          ▒
  0.43 │       shl    $0x4,%rbp                                                                                                                                          ▒
  1.33 │       add    %rdi,%rbp                                                                                                                                          ▒
  0.49 │       mov    0x0(%rbp),%rsi                                                                                                                                     ▒
 24.40 │       cmp    %rsi,%r15                                                                                                                                          ▒
       │     ↓ jne    420                                                                                                                                                ▒
       │     _ULx86_64_tdep_trace():                                                                                                                                     ▒
  2.10 │18e:   movzbl 0x8(%rbp),%edx                                                                                                                                     ▒
  3.68 │       test   $0x8,%dl                                                                                                                                           ▒
       │     ↓ jne    370                                                                                                                                                ▒
  1.27 │       mov    %edx,%eax                                                                                                                                          ▒
  0.06 │       shl    $0x5,%eax                                                                                                                                          ▒
  0.73 │       sar    $0x5,%al                                                                                                                                           ▒
  1.70 │       cmp    $0xfe,%al                                                                                                                                          ▒
       │     ↓ je     380                                                                                                                                                ▒
  0.01 │     ↓ jle    2f0                                                                                                                                                ▒
  0.01 │       cmp    $0xff,%al                                                                                                                                          ▒
       │     ↓ je     3a0                                                                                                                                                ▒
  0.02 │       cmp    $0x1,%al                                                                                                                                           ▒
       │     ↓ jne    298                                                                                                                                                ▒
  0.01 │       and    $0x10,%edx                                                                                                                                         ▒
       │       movl   $0x1,0x10(%rsp)                                                                                                                                    ▒
       │       movl   $0x1,0x1c8(%rbx)                                                                                                                                   ▒
  0.00 │     ↓ je     393

The corresponding source code is here trace_lookup source code , If I read correctly, the number of lines of code corresponding to this hot path cmp instruction is line 296, but I don't know why this line is so slow and cost most of the time?对应的源码在这里trace_lookup source code ，如果我没有看错的话，这条热路径cmp指令对应的代码行数是296行，但是不知道为什么这一行这么慢，而且大部分时间都在消耗？

Answer 1

Command cmp %rsi,%r15 is marked as having huge overhead because it waits for data to be loaded from cache or memory by mov 0x0(%rbp),%rsi command.命令cmp %rsi,%r15被标记为具有巨大的开销，因为它等待通过mov 0x0(%rbp),%rsi命令从缓存或 memory 加载数据。 There is probably L1 or even L2 cache miss on that command.该命令可能存在 L1 甚至 L2 缓存未命中。

For the code fragment对于代码片段

       │     trace_lookup():
  0.50 │       lea    -0x1(%r12),%rdx
  1.22 │       shr    $0x2b,%r13     
  0.37 │       and    %r13,%rdx      
  0.57 │177:   mov    %rdx,%rbp      
  0.43 │       shl    $0x4,%rbp      
  1.33 │       add    %rdi,%rbp      
  0.49 │       mov    0x0(%rbp),%rsi 
 24.40 │       cmp    %rsi,%r15      
       │     ↓ jne    420

you have 24% of profiling events of the current function accounted to the cmp instruction.当前 function 的分析事件的 24% 占到了 cmp 指令。 Default event for sampling profiling is "cycles" (hardware event for CPU clock cycles) or "cpu-clock" (software event for linear time).采样分析的默认事件是“cycles”（CPU 时钟周期的硬件事件）或“cpu-clock”（线性时间的软件事件）。 So, around 24% of sampling interrupts which did interrupt this function were reported for instruction address of this cmp command.因此，大约 24% 的采样中断确实中断了这个 function 被报告为这个 cmp 命令的指令地址。 There is systematic skew possible with profiling and modern Out-of-order CPUs, when cost is reported not for command which did run slowly, but for the command which did not finish its execution (retire) quickly.分析和现代乱序 CPU 可能存在系统性偏差，当报告成本不是针对运行缓慢的命令，而是针对未快速完成执行（退出）的命令。 This cmp+jne command pair (fused uop) will change instruction flow of program if %rsi register value is not equal to %r15 register value.如果 %rsi 寄存器值不等于 %r15 寄存器值，则此 cmp+jne 命令对（融合 uop）将更改程序的指令流。 For ancient times such command should just read two registers and compare their values, which is fast and should not take 1/4 of function execution time.对于古代这样的命令应该只读取两个寄存器并比较它们的值，这是快速的并且不应该花费 function 执行时间的 1/4。 But with modern CPU registers are not just 32 or 64 bit place to store the value, they have some hidden flags (or renaming techniques) used in Out-of-order engines.但是现代 CPU 寄存器不仅仅是存储值的 32 位或 64 位位置，它们还有一些用于乱序引擎的隐藏标志（或重命名技术）。 In your example, there was mov 0x0(%rbp),%rsi which did change %rsi register.在您的示例中，有mov 0x0(%rbp),%rsi确实更改了 %rsi 寄存器。 This command is load from memory by address *%rbp.这个命令是从 memory 通过地址 *%rbp 加载的。 CPU did start this load into cache/memory subsystem and mark %rsi register as "load pending from memory", continuing to execute instructions. CPU 确实开始将此加载到缓存/内存子系统中，并将 %rsi 寄存器标记为“从内存加载挂起”，继续执行指令。 There are some chances that next instructions will not require result of that load (which takes some time, for example Haswell : 4 cpu cycles for L1 hit, 12 for L2 hit, 36-66 for L3 hit, and additional 50-100 ns for cache miss and RAM read).下一条指令有可能不需要该加载的结果（这需要一些时间，例如Haswell ：L1 命中 4 个 cpu 周期，L2 命中 12 个，L3 命中 36-66 个，额外 50-100 ns缓存未命中和 RAM 读取）。 But in your case next instruction was cmp+jne with read from %rsi, and that instruction can't be finished until data from memory is written to %rsi (CPU may block in the middle of cmp+jne execution or do many restarts of that command).但是在你的情况下，下一条指令是 cmp+jne 并从 %rsi 读取，并且在将来自 memory 的数据写入 %rsi 之前，该指令无法完成（CPU 可能会在 cmp+jne 执行过程中阻塞或多次重启那个命令）。 So, cmp has 24% overhead because that mov did miss nearest caches.因此， cmp 有 24% 的开销，因为 mov 确实错过了最近的缓存。 With more advanced counters you can estimate which cache it did miss, and which cache/memory layer did serve the request most often.使用更高级的计数器，您可以估计它错过了哪个缓存，以及哪个缓存/内存层最常为请求提供服务。

The corresponding source code is here trace_lookup source code, If I read correctly, the number of lines of code corresponding to this hot path cmp instruction is line 296, but I don't know why this line is so slow and cost most of the time?对应的源码在这里trace_lookup源码，如果我没有看错的话，这个hot path cmp指令对应的代码行数是296行，但是不知道为什么这一行这么慢，而且大部分时间都是耗费的？

With so short asm fragment it can be hard to find corresponding code line in source code of trace_lookup and to find what value and why was not in L1/L2 cache.对于如此短的 asm 片段，很难在 trace_lookup 的源代码中找到相应的代码行，也很难找到什么值以及为什么不在 L1/L2 缓存中。 You should try to write shorted reproducible example.您应该尝试编写简短的可重现示例。

为什么 cmp 指令花费太多时间？

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-10-01 12:26:31

为什么 cmp 指令花费太多时间？

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-10-01 12:26:31

解决方案1
2 已采纳 2020-10-01 12:26:31