为什么Perf和Papi为L3缓存引用和未命中提供不同的值？

Question

I am working on a project where we have to implement an algorithm that is proven in theory to be cache friendly. 我正在开发一个项目，我们必须实现一个在理论上被证明是缓存友好的算法。 In simple terms, if N is the input and B is the number of elements that get transferred between the cache and the RAM every time we have a cache miss, the algorithm will require O(N/B) accesses to the RAM. 简单来说，如果N是输入， B是每次我们有高速缓存未命中时在高速缓存和RAM之间传输的元素数，则该算法将需要对RAM进行O(N/B)访问。

I would like to show that this is indeed the behavior in practice. 我想表明这确实是实践中的行为。 To better understand how one can measure various cache related hardware counters, I decided to use different tools. 为了更好地理解如何测量各种缓存相关的硬件计数器，我决定使用不同的工具。 One is Perf and the other is the PAPI library. 一个是Perf ，另一个是PAPI库。 Unfortunately, the more I work with these tools, the less I understand what they do exactly. 不幸的是，我使用这些工具越多，我就越不了解他们的确切做法。

I am using an Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz with 8 GB of RAM, L1 cache 256 KB, L2 cache 1 MB, L3 cache 6 MB. 我正在使用Intel（R）Core（TM）i5-3470 CPU @ 3.20GHz，8 GB RAM，L1缓存256 KB，L2缓存1 MB，L3缓存6 MB。 The cache line size is 64 bytes. 缓存行大小为64字节。 I guess that must be the size of the block B . 我想那必须是B的大小。

Let's look at the following example: 我们来看下面的例子：

#include <iostream>

using namespace std;

struct node{
    int l, r;
};

int main(int argc, char* argv[]){

    int n = 1000000;

    node* A = new node[n];

    int i;
    for(i=0;i<n;i++){
        A[i].l = 1;
        A[i].r = 4;
    }

    return 0;
}

Each node requires 8 bytes, which means that a cache line can fit 8 nodes, so I should be expecting approximately 1000000/8 = 125000 L3 cache misses. 每个节点需要8个字节，这意味着一个缓存行可以容纳8个节点，所以我应该期待大约1000000/8 = 125000 L3缓存未命中。

Without optimization (no -O3 ), this is the output from perf: 没有优化（没有-O3 ），这是perf的输出：

 perf stat -B -e cache-references,cache-misses ./cachetests 

 Performance counter stats for './cachetests':

       162,813      cache-references                                            
       142,247      cache-misses              #   87.368 % of all cache refs    

   0.007163021 seconds time elapsed

It is pretty close to what we are expecting. 它非常接近我们的预期。 Now suppose that we use the PAPI library. 现在假设我们使用PAPI库。

#include <iostream>
#include <papi.h>

using namespace std;

struct node{
    int l, r;
};

void handle_error(int err){
    std::cerr << "PAPI error: " << err << std::endl;
}

int main(int argc, char* argv[]){

    int numEvents = 2;
    long long values[2];
    int events[2] = {PAPI_L3_TCA,PAPI_L3_TCM};

    if (PAPI_start_counters(events, numEvents) != PAPI_OK)
        handle_error(1);

    int n = 1000000;
    node* A = new node[n];
    int i;
    for(i=0;i<n;i++){
        A[i].l = 1;
        A[i].r = 4;
    }

    if ( PAPI_stop_counters(values, numEvents) != PAPI_OK)
        handle_error(1);

    cout<<"L3 accesses: "<<values[0]<<endl;
    cout<<"L3 misses: "<<values[1]<<endl;
    cout<<"L3 miss/access ratio: "<<(double)values[1]/values[0]<<endl;

    return 0;
}

This is the output that I get: 这是我得到的输出：

L3 accesses: 3335
L3 misses: 848
L3 miss/access ratio: 0.254273

Why such a big difference between the two tools? 为什么两个工具之间有这么大的差异？

Answer 1

You can go through the source files of both perf and PAPI to find out to which performance counter they actually map these events, but it turns out they are the same (assuming Intel Core i here): Event 2E with umask 4F for references and 41 for misses. 你可以浏览perf和PAPI的源文件，找出他们实际映射这些事件的性能计数器，但事实证明它们是相同的（假设Intel Core i在这里）：带有umask 4F事件2E用于引用和41因为未命中 In the the Intel 64 and IA-32 Architectures Developer's Manual these events are described as: 在Intel 64和IA-32架构开发人员手册中，这些事件描述为：

2EH 4FH LONGEST_LAT_CACHE.REFERENCE This event counts requests originating from the core that reference a cache line in the last level cache. 2EH 4FH LONGEST_LAT_CACHE.REFERENCE此事件计算源自引用最后一级缓存中的缓存行的核心的请求。

2EH 41H LONGEST_LAT_CACHE.MISS This event counts each cache miss condition for references to the last level cache. 2EH 41H LONGEST_LAT_CACHE.MISS此事件计算对最后一级缓存的引用的每个缓存未命中条件。

That seems to be ok. 这似乎没问题。 So the problem is somewhere else. 所以问题出在其他地方。

Here are my reproduced numbers, only that I increased the array length by a factor of 100. (I noticed large fluctuations in timing results otherwise and with length of 1,000,000 the array almost fits into your L3 cache still). 这是我的再现数字，只是我将数组长度增加了100倍。（我注意到时序结果有很大的波动，否则长度为1,000,000，阵列几乎适合你的L3缓存）。 main1 here is your first code example without PAPI and main2 your second one with PAPI. main1这里是你没有PAPI的第一个代码示例， main2你的第二个PAPI代码示例。

$ perf stat -e cache-references,cache-misses ./main1 

 Performance counter stats for './main1':

        27.148.932      cache-references                                            
        22.233.713      cache-misses              #   81,895 % of all cache refs 

       0,885166681 seconds time elapsed

$ ./main2 
L3 accesses: 7084911
L3 misses: 2750883
L3 miss/access ratio: 0.388273

These obviously don't match. 这些显然不匹配。 Let's see where we actually count the LLC references. 让我们看看我们实际计算LLC参考的位置。 Here are the first few lines of perf report after perf record -e cache-references ./main1 : 以下是perf record -e cache-references ./main1之后的几行perf report ：

  31,22%  main1    [kernel]          [k] 0xffffffff813fdd87                                                                                                                                   ▒
  16,79%  main1    main1             [.] main                                                                                                                                                 ▒
   6,22%  main1    [kernel]          [k] 0xffffffff8182dd24                                                                                                                                   ▒
   5,72%  main1    [kernel]          [k] 0xffffffff811b541d                                                                                                                                   ▒
   3,11%  main1    [kernel]          [k] 0xffffffff811947e9                                                                                                                                   ▒
   1,53%  main1    [kernel]          [k] 0xffffffff811b5454                                                                                                                                   ▒
   1,28%  main1    [kernel]          [k] 0xffffffff811b638a                                              
   1,24%  main1    [kernel]          [k] 0xffffffff811b6381                                                                                                                                   ▒
   1,20%  main1    [kernel]          [k] 0xffffffff811b5417                                                                                                                                   ▒
   1,20%  main1    [kernel]          [k] 0xffffffff811947c9                                                                                                                                   ▒
   1,07%  main1    [kernel]          [k] 0xffffffff811947ab                                                                                                                                   ▒
   0,96%  main1    [kernel]          [k] 0xffffffff81194799                                                                                                                                   ▒
   0,87%  main1    [kernel]          [k] 0xffffffff811947dc

So what you can see here is that actually only 16.79% of the cache references actually happen in user space, the rest are due to the kernel. 所以你在这里看到的实际上只有16.79％的缓存引用实际发生在用户空间中，其余的都是由内核引起的。

And here lies the problem. 这就是问题所在。 Comparing this to the PAPI result is unfair, because PAPI by default only counts user space events. 将其与PAPI结果进行比较是不公平的，因为默认情况下PAPI仅计算用户空间事件。 Perf however by default collects user and kernel space events. 但是默认情况下，Perf会收集用户和内核空间事件。

For perf we can easily reduce to user space collection only: 对于perf，我们只能轻松减少到用户空间集合：

$ perf stat -e cache-references:u,cache-misses:u ./main1 

 Performance counter stats for './main1':

         7.170.190      cache-references:u                                          
         2.764.248      cache-misses:u            #   38,552 % of all cache refs    

       0,658690600 seconds time elapsed

These seem to match pretty well. 这些似乎非常匹配。

Edit: 编辑：

Lets look a bit closer at what the kernel does, this time with debug symbols and cache misses instead of references: 让我们看一下内核的作用，这次使用调试符号和缓存未命中而不是引用：

  59,64%  main1    [kernel]       [k] clear_page_c_e
  23,25%  main1    main1          [.] main
   2,71%  main1    [kernel]       [k] compaction_alloc
   2,70%  main1    [kernel]       [k] pageblock_pfn_to_page
   2,38%  main1    [kernel]       [k] get_pfnblock_flags_mask
   1,57%  main1    [kernel]       [k] _raw_spin_lock
   1,23%  main1    [kernel]       [k] clear_huge_page
   1,00%  main1    [kernel]       [k] get_page_from_freelist
   0,89%  main1    [kernel]       [k] free_pages_prepare

As we can see most cache misses actually happen in clear_page_c_e . 我们可以看到大多数缓存未命中实际发生在clear_page_c_e 。 This is called when a new page is accessed by our program. 当我们的程序访问新页面时调用此方法。 As explained in the comments new pages are zeroed by the kernel before allowing access, therefore the cache miss already happens here. 正如评论中所解释的，在允许访问之前，内核将新页面归零，因此缓存未命中已在此处发生。

This messes with your analysis, because a good part of the cache misses you expect happen in kernel space. 这与您的分析混淆，因为您期望在内核空间中发生的缓存未命中的很大一部分。 However you can not guarantee under which exact circumstances the kernel actually accesses memory, so that might be deviations from the behavior expected by your code. 但是，您无法保证内核实际访问内存的确切情况，因此可能会偏离代码所期望的行为。

To avoid this build an additional loop around your array-filling one. 为了避免这种情况，在数组填充周围建立一个额外的循环。 Only the first iteration of the inner loop incurs the kernel overhead. 只有内部循环的第一次迭代才会产生内核开销。 As soon as every page in the array was accessed, there should be no contribution left. 一旦访问了数组中的每个页面，就不会有任何贡献。 Here is my result for 100 repetition of the outer loop: 这是我重复外循环的结果：

$ perf stat -e cache-references:u,cache-references:k,cache-misses:u,cache-misses:k ./main1

 Performance counter stats for './main1':

     1.327.599.357      cache-references:u                                          
        23.678.135      cache-references:k                                          
     1.242.836.730      cache-misses:u            #   93,615 % of all cache refs    
        22.572.764      cache-misses:k            #   95,332 % of all cache refs    

      38,286354681 seconds time elapsed

The array length was 100,000,000 with 100 iterations and therefore you would have expected 1,250,000,000 cache misses by your analysis. 阵列长度为100,000,000，有100次迭代，因此您的分析预计会有1,250,000,000个缓存未命中。 This is pretty close now. 现在已经非常接近了。 The deviation is mostly from the first loop which is loaded to the cache by the kernel during page clearing. 偏差主要来自第一个循环，第一个循环在页面清除期间由内核加载到高速缓存中。

With PAPI a few extra warm-up loops can be inserted before the counter starts, and so the result fits the expectation even better: 使用PAPI，可以在计数器启动之前插入一些额外的预热循环，因此结果更符合预期：

$ ./main2 
L3 accesses: 1318699729
L3 misses: 1250684880
L3 miss/access ratio: 0.948423

为什么Perf和Papi为L3缓存引用和未命中提供不同的值？

问题描述

1 个解决方案

解决方案1
11 已采纳 2016-10-03 02:45:08

为什么Perf和Papi为L3缓存引用和未命中提供不同的值？

问题描述

1 个解决方案

解决方案1 11 已采纳 2016-10-03 02:45:08

解决方案1
11 已采纳 2016-10-03 02:45:08