LLC 負載值與 perf stat 不一致

Question

我正在嘗試使用 perf stat 來獲取硬件計數器信息，以便在英特爾至強處理器（基於 Skylake）上進行基准測試。 當我提供-e LLC-loads -d -d -d標志時，perf stat 打印出 LLC-loads 兩次 - 一次是由於-e LLC-loads ，另一次是由於打開了詳細標志。 但是，結果不一致：

$ perf stat -e LLC-loads,LLC-stores,L1-dcache-loads,L1-dcache-stores -d -d -d <my benchmark executable>

Performance counter stats for '<my benchmark executable>':

        5145246847      LLC-loads                                                     (30.78%)
        8167130238      LLC-stores                                                    (30.80%)
      198057619358      L1-dcache-loads                                               (30.80%)
       83142567530      L1-dcache-stores                                              (30.80%)
      197792116698      L1-dcache-loads                                               (30.79%)
       27391515211      L1-dcache-load-misses     #   13.84% of all L1-dcache hits    (30.78%)
        5114059688      LLC-loads                                                     (30.78%)
        3025020254      LLC-load-misses           #   58.97% of all LL-cache hits     (30.76%)
   <not supported>      L1-icache-loads                                             
          58697135      L1-icache-load-misses                                         (30.75%)
      198322967573      dTLB-loads                                                    (30.74%)
         209105723      dTLB-load-misses          #    0.11% of all dTLB cache hits   (30.72%)
           2639992      iTLB-loads                                                    (30.74%)
           1368656      iTLB-load-misses          #   51.84% of all iTLB cache hits   (30.76%)
   <not supported>      L1-dcache-prefetches                                        
   <not supported>      L1-dcache-prefetch-misses                                   

      25.301480157 seconds time elapsed

     585.222699000 seconds user
       1.070800000 seconds sys

從output中可以看出，output中有兩個不同值的LLC-load。 我錯了什么？

我已經嘗試了多種不同的基准測試，假設它可能是特定於基准測試的，但這種行為隨處可見。

Answer 1

請注意多路復用，因為您指定了如此多的事件：它們在總時間的(30.78%)中被采樣，並從中推斷出數量。 Skylake 每個邏輯內核只有 4 個可編程計數器，可以同時對不同的硬件事件進行計數。

您的程序並非隨時間 100% 均勻，並且存在采樣/外推噪聲，因此數字接近但相差幾個百分點。 （多路復用代碼沒有組合指定兩次的事件，而只是將它的兩個實例放入輪換中。）

如果您只計算事件的兩個實例而沒有許多其他事件，您會期望計數完全相等，因為它們在不同的 HW 計數器上同時處於活動狀態。 （除非第一個計數器在編程后會對任何事件進行計數，而 kernel 仍在對第二個進行編程。--all --all-user會避免這種情況，告訴硬件僅在邏輯核心位於用戶空間時才計數。）例如

$ perf stat -e LLC-loads,LLC-loads cmp /dev/zero /dev/full
^Ccmp: Interrupt

 Performance counter stats for 'cmp /dev/zero /dev/full':

            31,425      LLC-loads                                                          
            31,425      LLC-loads                                                          

       2.748813842 seconds time elapsed

       1.113722000 seconds user
       1.633880000 seconds sys

（計數很少，我猜cmp使用的緩沖區足夠小以適合 L3 緩存。我使用了兩個不同的文件，它們將讀取為全零，因此它不能檢測到它們是相同的。）

有關的：

Perf 工具統計 output：“循環”的多路復用和縮放- instructions:D和cycles:D將告訴 perf 始終計算那些； Intel CPU 上有針對這些事件的專用非可編程計數器，但多路復用代碼不知道這些。 您可以對其他事件執行此操作，但這會占用您未指定:D的事件的插槽。

LLC 負載值與 perf stat 不一致

問題描述

1 個解決方案

解決方案1
0 已采納 2022-12-03 11:11:24

LLC 負載值與 perf stat 不一致

問題描述

1 個解決方案

解決方案1 0 已采納 2022-12-03 11:11:24

解決方案1
0 已采納 2022-12-03 11:11:24