简体   繁体   中英

Measure L1 data cache miss with perf and papi

What is the difference between PAPI_L1_LDM in papi and L1-dcache-load-misses in perf?

I've used the same setting, like this post here .

So, as a result I get for papi:

PAPI_L1_DCM: 515 <- L1 data cache miss (probably L1D_READ_MISSES_ALL + L1D_READ_MISSES_RETRIED?)
PAPI_L1_ICM: 300 <- L1 Instruction cache miss
PAPI_L1_LDM: 441 <- L1 Load data miss
PAPI_L1_TCM: 815 <- L1 Total cache miss

Unfortunately PAPI_L1_DCA is not supported at this machine.

And for perf (only in the user-space, since papi measures also only user-space and no kernel space): call: perf stat -B -e L1-dcache-load-misses:u,cache-misses:u ./perf

    16,539      L1-dcache-load-misses
       128      cache-misses:u  

16,539 seems to be more reasonable for N=1000000 . What is the difference between a load-data-miss (PAPI_L1_LDM in papi) and a data cache miss (PAPI_L1_DCM in papi) and why do these numbers differ in papi and perf? Is the cache-misses:u in perf related to the L2 cache-misses?

edit: Hardware (Xeon E5-2600 v3 family, Haswell-EP 12 cores)

Some Explanation:

From PAPI man page , you can see that PAPI_L1_LDM = "number of Load misses". In other words PAPI_L1_LDM are the misses only occuring from loads (and sometimes pre-fetches ).

Load is when your program executes a load instruction to retrieve memory.

Pre-Fetch is when the process guesses that you are going to load memory in the near future and fetches it ahead of time.


In L1-dcache-load-misses

  • L1 is the Level-1 cache, the smallest and fastest one. LLC on the other hand refers to the last level of the cache hierarchy , thus denoting the largest but slowest cache.
  • i vs. d distinguishes instruction cache from data cache. Only L1 is split in this way, other caches are shared between data and instructions.

You seem to think that the cache-misses:u in perf related to the L2 cache-misses. That is actually not true.

The cache-misses event represents the number of memory access that could not be served by any of the cache.

I admit that perf's documentation is not the best around.

However, one can learn quite a lot about it by reading (assuming that you already have a good knowledge of how a CPU and a performance monitoring unit work, this is clearly not a computer architecture course) the doc of the perf_event_open() function:

For example , by reading it you can see that the cache-misses event showed by perf list corresponds to PERF_COUNT_HW_CACHE_MISSES

  • Further you can find that L1-dcache-load-misses is a Hardware cache event and cache-misses is Hardware event (which is a super-set of Hardware cache event ).

And regarding your difference you can consult this answer for the reason, which says that increase the size of your array by the factor 100 or even 10000 because it says "I noticed large fluctuations in timing results otherwise and with length of 1,000,000 the array almost fits into your L3 cache still."

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM