简体   繁体   English

Linux perf报告缓存未命中意外指令

[英]Linux perf reporting cache misses for unexpected instruction

I'm trying to apply some performance engineering techniques to an implementation of Dijkstra's algorithm. 我正在尝试将一些性能工程技术应用于Dijkstra算法的实现。 In an attempt to find bottlenecks in the (naive and unoptimised) program, I'm using the perf command to record the number of cache misses. 为了找到(天真的和未经优化的)程序中的瓶颈,我使用perf命令来记录缓存未命中的数量。 The snippet of code that is relevant is the following, which finds the unvisited node with the smallest distance: 相关的代码片段如下,它找到距离最小的未访问节点:

for (int i = 0; i < count; i++) {
    if (!visited[i]) {
        if (tmp == -1 || dist[i] < dist[tmp]) {
            tmp = i;
        }
    }
}

For the LLC-load-misses metric, perf report shows the following annotation of the assembly: 对于LLC-load-misses指标, perf report显示程序集的以下注释:

       │             for (int i = 0; i < count; i++) {                                                                                                                           ▒
  1.19 │ ff:   add    $0x1,%eax                                                                                                                                                  ▒
  0.03 │102:   cmp    0x20(%rsp),%eax                                                                                                                                            ▒
       │     ↓ jge    135                                                                                                                                                        ▒
       │                 if (!visited[i]) {                                                                                                                                      ▒
  0.07 │       movslq %eax,%rdx                                                                                                                                                  ▒
       │       mov    0x18(%rsp),%rdi                                                                                                                                            ◆
  0.70 │       cmpb   $0x0,(%rdi,%rdx,1)                                                                                                                                         ▒
  0.53 │     ↑ jne    ff                                                                                                                                                         ▒
       │                     if (tmp == -1 || dist[i] < dist[tmp]) {                                                                                                             ▒
  0.07 │       cmp    $0xffffffff,%r13d                                                                                                                                          ▒
       │     ↑ je     fc                                                                                                                                                         ▒
  0.96 │       mov    0x40(%rsp),%rcx                                                                                                                                            ▒
  0.08 │       movslq %r13d,%rsi                                                                                                                                                 ▒
       │       movsd  (%rcx,%rsi,8),%xmm0                                                                                                                                        ▒
  0.13 │       ucomis (%rcx,%rdx,8),%xmm0                                                                                                                                        ▒
 57.99 │     ↑ jbe    ff                                                                                                                                                         ▒
       │                         tmp = i;                                                                                                                                        ▒
       │       mov    %eax,%r13d                                                                                                                                                 ▒
       │     ↑ jmp    ff                                                                                                                                                         ▒
       │                     }                                                                                                                                                   ▒
       │                 }                                                                                                                                                       ▒
       │             }   

My question then is the following: why does the jbe instruction produce so many cache misses? 我的问题是:为什么jbe指令会产生如此多的缓存未命中? This instruction should not have to retrieve anything from memory at all if I am not mistaken. 如果我没有弄错的话,该指令根本不必从内存中检索任何内容。 I figured it might have something to do with instruction cache misses, but even measuring only L1 data cache misses using L1-dcache-load-misses shows that there are a lot of cache misses in that instruction. 我认为它可能与指令缓存未命中有关,但即使仅使用L1-dcache-load-misses miss测量L1数据缓存未L1-dcache-load-misses ,也表明该指令中存在大量缓存未命中。

This stumps me somewhat. 这有点让我感到困惑。 Could anyone explain this (in my eyes) odd result? 谁能解释这个(在我看来)奇怪的结果? Thank you in advance. 先感谢您。

About your example: 关于你的例子:

There are several instructions before and at the high counter: 高柜台前和高柜台有几条指令:

        │       movsd  (%rcx,%rsi,8),%xmm0
   0.13 │       ucomis (%rcx,%rdx,8),%xmm0
  57.99 │     ↑ jbe    ff

"movsd" loads word from (%rcx,%rsi,8) (some array access) into xmm0 register, and "ucomis" loads another word from (%rcx,%rdx,8) and compares it with just loaded value in xmm0 register. “movsd”将来自(%rcx,%rsi,8) (某些数组访问)的字加载到xmm0寄存器中,“ucomis”从(%rcx,%rdx,8)加载另一个字,并将其与xmm0中刚刚加载的值进行比较寄存器。 "jbe" is conditional jump which depends on compare outcome. “jbe”是条件跳跃,取决于比较结果。

Many modern Intel CPUs (and AMD probably too) can and will fuse (combine) some combinations of operations (realworldtech.com/nehalem/5 "into a single uop, CMP+JCC") together, and cmp + conditional jump very common instruction combination to be fused (you can check it with Intel IACA simulating tool, use ver 2.1 for your CPU). 许多现代英特尔CPU(以及AMD可能也是)可以并且将一些操作组合(组合)(realworldtech.com/nehalem/5“融合到单个uop,CMP + JCC中)”和cmp +条件跳转非常常见的指令要融合的组合(您可以使用Intel IACA模拟工具进行检查,对您的CPU使用ver 2.1)。 Fused pair may be reported in perf/PMUs/PEBS incorrectly with skew of most events towards one of two instructions. 可以在perf / PMU / PEBS中错误地报告融合对,其中大多数事件偏向两个指令之一。

This code probably means that expression "dist[i] < dist[tmp]" generates two memory accesses, and both of values are used in ucomis instruction which is (partially?) fused with jbe conditional jump. 此代码可能意味着表达式“dist [i] <dist [tmp]”生成两次内存访问,并且这两个值都在ucomis指令中使用, ucomis指令与jbe条件跳转(部分?)融合。 Either dist[i] or dist[tmp] or both expressions generates high number of misses. dist [i]或dist [tmp]或两个表达式都会产生大量未命中。 Any of such miss will block ucomis to generate result and block jbe to give next instruction to execute (or to retire predicted instructions). 任何这样的未命中将阻止ucomis生成结果并阻塞jbe以给出下一条指令以执行(或退出预测的指令)。 So, jbe may get all fame of high counters instead of real memory-access instructions (and for "far" event like cache response there is some skew towards last blocked instruction). 因此, jbe可能会获得高计数器的所有声誉,而不是真正的内存访问指令(对于像“缓存”响应这样的“远”事件,最后一个被阻止的指令存在一些偏差)。

You may try to merge visited[N] and dist[N] arrays into array[N] of struct { int visited; float dist} 您可以尝试将visited [N]和dist [N]数组合并到struct { int visited; float dist}数组[N]中struct { int visited; float dist} struct { int visited; float dist} to force prefetching of array[i].dist when you access array[i].visited or you may try to change order of vertex access, or renumber graph vertex, or do some software prefetch for next one or more elements (?) struct { int visited; float dist}强制的预取array[i].dist时访问array[i].visited或可能尝试改变的顶点访问顺序,或重新编号图形顶点,或为下一个或多个元件执行一些软件预取( ?)


About generic perf event by name problems and possible uncore skew. 关于名称问题和可能的非核心倾斜的通用perf事件。

perf (perf_events) tool in Linux uses predefined set of events when called as perf list , and some listed hardware events can be not implemented; Linux中的perf (perf_events)工具在被称为perf list时使用预定义的事件集,并且某些列出的硬件事件可能无法实现; others are mapped to current CPU capabilities (and some mappings are not fully correct). 其他映射到当前的CPU功能(并且一些映射不完全正确)。 Some basic info about real PMU is in your https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf (but it has more details for related Nehalem-EP variant). 有关真实PMU的一些基本信息在您的https://software.intel.com/sites/products/collat​​eral/hpc/vtune/performance_analysis_guide.pdf中 (但它有关于相关Nehalem-EP变体的更多详细信息)。

For your Nehalem (Intel Core i5 750 with L3 cache of 8MB and without multi-CPU/multi-socket/NUMA support) perf will map standard ( "Generic cache events" ) LLC-load-misses event as .. "OFFCORE_RESPONSE.ANY_DATA.ANY_LLC_MISS" as written in the best documentation of perf event mappings (the only one) - kernel source code 对于Nehalem(Intel Core i5 750,L3缓存为8MB,没有多CPU /多插槽/ NUMA支持),perf会将标准( “通用缓存事件”LLC-load-misses miss事件映射为..“OFFCORE_RESPONSE.ANY_DATA .ANY_LLC_MISS“写在perf事件映射的最佳文档中(唯一的) - 内核源代码

http://elixir.free-electrons.com/linux/v4.8/source/arch/x86/events/intel/core.c#L1103 http://elixir.free-electrons.com/linux/v4.8/source/arch/x86/events/intel/core.c#L1103

 u64 nehalem_hw_cache_event_ids ...
[ C(LL  ) ] = {
    [ C(OP_READ) ] = {
        /* OFFCORE_RESPONSE.ANY_DATA.LOCAL_CACHE */
        [ C(RESULT_ACCESS) ] = 0x01b7,
        /* OFFCORE_RESPONSE.ANY_DATA.ANY_LLC_MISS */
        [ C(RESULT_MISS)   ] = 0x01b7,
...
/*
 * Nehalem/Westmere MSR_OFFCORE_RESPONSE bits;
 * See IA32 SDM Vol 3B 30.6.1.3
 */
#define NHM_DMND_DATA_RD    (1 << 0)
#define NHM_DMND_READ       (NHM_DMND_DATA_RD)
#define NHM_L3_MISS (NHM_NON_DRAM|NHM_LOCAL_DRAM|NHM_REMOTE_DRAM|NHM_REMOTE_CACHE_FWD)
...
 u64 nehalem_hw_cache_extra_regs
  ..
 [ C(LL  ) ] = {
    [ C(OP_READ) ] = {
        [ C(RESULT_ACCESS) ] = NHM_DMND_READ|NHM_L3_ACCESS,
        [ C(RESULT_MISS)   ] = NHM_DMND_READ|NHM_L3_MISS,

I think this event is not precise: cpu pipeline will post (with out-of-order) load request to the cache hierarchy and will execute other instructions. 我认为这个事件并不精确:cpu管道会将(无序)加载请求发布到缓存层次结构并执行其他指令。 After some time ( around 10 cycles to reach and get response from L2 and 40 cycles to reach L3 ) there will be response with miss flag in the corresponding (offcore?) PMU to increment counter. 经过一段时间后( 大约10个周期达到并从L2获得响应, 40个周期达到L3 ),在相应的(offcore?)PMU中会有miss标志响应,以增加计数器。 On this counter overflow, profiling interrupt will be generated from this PMU. 在此计数器溢出时,将从此PMU生成分析中断。 In several cpu clock cycles it will reach pipeline to interrupt it, perf_events subsystem's handler will handle this with registering current (interrupted) EIP/RIP Instruction pointer and reset PMU counter back to some negative value (for example, -100000 to get interrupt for every 100000 L3 misses counted; use perf record -e LLC-load-misses -c 100000 to set exact count or perf will autotune limit to get some default frequency). 在几个cpu时钟周期中,它将到达管道中断它,perf_events子系统的处理程序将通过注册当前(中断的)EIP / RIP指令指针并将PMU计数器重置为某个负值来处理此问题(例如,-100000以获得每个中断)计算100000 L3未命中;使用perf record -e LLC-load-misses -c 100000设置精确计数或perf将自动调谐限制以获得一些默认频率)。 The registered EIP/RIP is not the IP of load command and it may be also not the EIP/RIP of command which wants to use the loaded data. 注册的EIP / RIP不是加载命令的IP,也可能不是要使用加载数据的命令的EIP / RIP。

But if your CPU is the only socket in the system and you access normal memory (not some mapped PCI-express space), L3 miss in fact will be implemented as local memory access and there are some counters for this... ( https://software.intel.com/en-us/node/596851 - "Any memory requests missing here must be serviced by local or remote DRAM"). 但是如果您的CPU是系统中唯一的插槽并且您访问普通内存(而不是某些映射的PCI-express空间),则L3 miss实际上将实现为本地内存访问,并且有一些计数器...( https: //software.intel.com/en-us/node/596851 - “此处丢失的任何内存请求必须由本地或远程DRAM提供服务”)。

There are some listings of PMU events for your CPU: 您的CPU有一些PMU事件列表:

There should be some information about ANY_LLC_MISS offcore PMU event implementation and list of PEBS events for Nhm, but I can't find it now. 应该有一些关于ANY_LLC_MISS offcore PMU事件实现的信息和Nhm的PEBS事件列表,但我现在找不到它。

I can recommend you to use ocperf from https://github.com/andikleen/pmu-tools with any PMU events of your CPU without need to manually encode them. 我建议您使用https://github.com/andikleen/pmu-tools中的 ocperf和CPU的任何PMU事件,而无需手动编码。 There are some PEBS events in your CPU, and there is Latency profiling / perf mem for some kind of memory access profiling (some random perf mem pdfs: 2012 post "perf: add memory access sampling support" , RH 2013 - pg26-30 , still not documented in 2015 - sowa pg19 , ls /sys/devices/cpu/events ). 你的CPU中有一些PEBS事件,并且有一些Latency profiling / perf mem用于某种内存访问分析(一些随机的perf mem pdfs: 2012 post“perf:add memory access sampling support”RH 2013 - pg26-30仍未在2015年记录 - sowa pg19ls /sys/devices/cpu/events )。 For newer CPUs there are newer tools like ucevent . 对于较新的CPU,有更新的工具,如ucevent

I also can recommend you to try cachegrind profiler/cache simulator tool of valgrind program with kcachegrind GUI to view profiles. 我还建议你尝试使用kcachegrind GUI来查看valgrind程序的cachegrind profiler / cache模拟器工具来查看配置文件。 Valgrind-based profilers may help you to get basic idea about how the code works: they collect exact instruction execution counts for every instruction, and cachegrind also simulates some abstract multi-level cache. 基于Valgrind的分析器可以帮助您了解代码的工作原理:它们为每条指令收集精确的指令执行计数,而cachegrind也模拟一些抽象的多级缓存。 But real CPU will execute several instruction per cycle (so, callgrind / cachegrind cost model of 1 instruction = 1 cpu clock cycle gives some error; cachegrind cache model have not the same logic as real cache). 但是真正的CPU会在每个周期执行几条指令(因此,1指令的callgrind / cachegrind成本模型= 1 cpu时钟周期会产生一些错误; cachegrind缓存模型与真实缓存的逻辑不同)。 And all valgrind tools are dynamic binary instrumentation tools which will slow down your program 20-30 times compared to native run. 并且所有valgrind工具都是动态二进制检测工具,与本机运行相比,它会使程序减慢20-30倍。

When you read a memory location, the processor will try to prefetch the adjacent memory locations and cache them. 当您读取内存位置时,处理器将尝试预取相邻的内存位置并对其进行缓存。

That works well if you are reading an array of objects which are all allocated in memory in contiguous blocks. 如果您正在读取一个对象数组,这些对象都在连续的块中分配在内存中,那么这种方法很有效。

However, if for example you have an array of pointers which live in the heap, it is less likely that you will be iterating over contiguous portions of memory unless you are using some sort of custom allocator specifically designed for this. 但是,如果你有一个存在于堆中的指针数组,那么除非你使用某种专门为此设计的自定义分配器,否则你不太可能在内存的连续部分上进行迭代。

Because of this, dereferencing should be seen as some sort of cost. 因此,解除引用应被视为某种成本。 An array of structs can be more efficient to an array of pointers to structs. 结构数组可以更有效地指向结构的指针数组。

Herb Sutter (member of C++ commitee) speaks about this in this talk https://youtu.be/TJHgp1ugKGM?t=21m31s Herb Sutter(C ++委员会成员)在这次演讲中谈到这个问题https://youtu.be/TJHgp1ugKGM?t=21m31s

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM