简体   繁体   English

测量ARM Cortex-A15的缓存访问时间/周期

[英]Measure cache access time/cycles for the ARM Cortex-A15

So I measured the cycles for accessing the L2 cache of the ARM Cortex-A15. 因此,我测量了访问ARM Cortex-A15的L2缓存的周期。 I did this by allocating one byte and 我通过分配一个字节来做到这一点,

  • invalidate the address 使地址无效
  • read the PMCCNTR register 读取PMCCNTR寄存器
  • access the memory location of the allocated byte with ldr 使用ldr访问分配的字节的内存位置
  • read the PMCCNTR register again 再次读取PMCCNTR寄存器
  • subtract first measurement from second 从第二减去第一测量

I got about ~240 cycles for cached access and ~350 for uncached access. 我大约有240个周期用于缓存访问,而大约350个周期用于未缓存访问。 I also used ISB, DMB and DSB. 我还使用了ISB,DMB和DSB。 Do these Numbers sound accurate to you? 这些数字听起来对您准确吗? I can't seem to find official ressources to compare with. 我似乎找不到要比较的官方资源。 Maybe you can point me in the right direction. 也许您可以指出正确的方向。

You are not measuring the latency with your approach, you are measuring the overhead. 您不是在使用方法来衡量延迟,而是在衡量开销。

A standard approach to measure latencies is to use a pointer chasing test, you initialize a chain of pointers so that you get dependent accesses, and you control their placement so that they fit (or not) in caches of specified sizes. 测量延迟的一种标准方法是使用指针追逐测试,初始化一个指针链,以便获得从属访问,并控制它们的放置,以使它们适合(或不适合)指定大小的缓存。 The rest of the procedure is the same except you don't invalidate anything. 其余过程相同,除了您不使任何内容无效。

Something like this (for illustration, not tested) 这样的东西(用于说明,未经测试)

// prepare a chain of N pointers in a buffer
// Assume unsigned int has the same size as a pointer
unsigned int Buffer[N] ;

// chain them, here in a simple direct fashion.  
// You can also use a randomized sequence if you work in main memory
for (i=1; i<N; i++) { Buffer[i] = (unsigned int) &(Buffer[i-1]) ; }

// close the chain
Buffer[0] = (unsigned int) &(Buffer[N-1]) ;

// measure M accesses
Start =  PMCCNTR() ;
p = &(Buffer[0]) ;
for (i=M; i>0; i--) {
  p = *p;
}
Stop = PMCCNTR();

Measuring a single access is subjected to inaccuracy due to measuring overhead and random interferences. 由于测量开销和随机干扰,因此测量单次访问存在不准确性。 You should measure time over a large number of accesses to get an amortized latency that would better reflect what you want. 您应该测量大量访问的时间,以获取摊销的延迟,从而更好地反映您的需求。 To measure the average access time you also need to make sure these accesses are not run in parallel (that would measure throughput, not latency), so add some false dependency, like adding the content of the previously accessed byte to the next address (after initializing all these bytes to zeros). 要测量平均访问时间,您还需要确保这些访问未并行运行(这将衡量吞吐量,而不是延迟),因此添加一些错误的依赖项,例如将先前访问的字节的内容添加到下一个地址(之后将所有这些字节初始化为零)。

Also, you didn't say how you were invalidating the address, but i'm guessing that you also threw it out of the L2, and are actually measuring memory latency only. 另外,您没有说如何使地址无效,但我想您也将其从L2中删除了,实际上只是在测量内存延迟。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM