简体   繁体   中英

Measure cache access time/cycles for the ARM Cortex-A15

So I measured the cycles for accessing the L2 cache of the ARM Cortex-A15. I did this by allocating one byte and

  • invalidate the address
  • read the PMCCNTR register
  • access the memory location of the allocated byte with ldr
  • read the PMCCNTR register again
  • subtract first measurement from second

I got about ~240 cycles for cached access and ~350 for uncached access. I also used ISB, DMB and DSB. Do these Numbers sound accurate to you? I can't seem to find official ressources to compare with. Maybe you can point me in the right direction.

You are not measuring the latency with your approach, you are measuring the overhead.

A standard approach to measure latencies is to use a pointer chasing test, you initialize a chain of pointers so that you get dependent accesses, and you control their placement so that they fit (or not) in caches of specified sizes. The rest of the procedure is the same except you don't invalidate anything.

Something like this (for illustration, not tested)

// prepare a chain of N pointers in a buffer
// Assume unsigned int has the same size as a pointer
unsigned int Buffer[N] ;

// chain them, here in a simple direct fashion.  
// You can also use a randomized sequence if you work in main memory
for (i=1; i<N; i++) { Buffer[i] = (unsigned int) &(Buffer[i-1]) ; }

// close the chain
Buffer[0] = (unsigned int) &(Buffer[N-1]) ;

// measure M accesses
Start =  PMCCNTR() ;
p = &(Buffer[0]) ;
for (i=M; i>0; i--) {
  p = *p;
}
Stop = PMCCNTR();

Measuring a single access is subjected to inaccuracy due to measuring overhead and random interferences. You should measure time over a large number of accesses to get an amortized latency that would better reflect what you want. To measure the average access time you also need to make sure these accesses are not run in parallel (that would measure throughput, not latency), so add some false dependency, like adding the content of the previously accessed byte to the next address (after initializing all these bytes to zeros).

Also, you didn't say how you were invalidating the address, but i'm guessing that you also threw it out of the L2, and are actually measuring memory latency only.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM