使用时间戳计数器测量内存延迟

Question

I have written the following code which first flushes two array elements and then tries to read elements in order to measure the hit/miss latencies. 我编写了以下代码，首先刷新两个数组元素，然后尝试读取元素以测量命中/未命中延迟。

#include <stdio.h>
#include <stdint.h>
#include <x86intrin.h>
#include <time.h>
int main()
{
    /* create array */
    int array[ 100 ];
    int i;
    for ( i = 0; i < 100; i++ )
        array[ i ] = i;   // bring array to the cache

    uint64_t t1, t2, ov, diff1, diff2, diff3;

    /* flush the first cache line */
    _mm_lfence();
    _mm_clflush( &array[ 30 ] );
    _mm_clflush( &array[ 70 ] );
    _mm_lfence();

    /* READ MISS 1 */
    _mm_lfence();           // fence to keep load order
    t1 = __rdtsc();         // set start time
    _mm_lfence();
    int tmp = array[ 30 ];   // read the first elemet => cache miss
    _mm_lfence();
    t2 = __rdtsc();         // set stop time
    _mm_lfence();

    diff1 = t2 - t1;        // two fence statements are overhead
    printf( "tmp is %d\ndiff1 is %lu\n", tmp, diff1 );

    /* READ MISS 2 */
    _mm_lfence();           // fence to keep load order
    t1 = __rdtsc();         // set start time
    _mm_lfence();
    tmp = array[ 70 ];      // read the second elemet => cache miss (or hit due to prefetching?!)
    _mm_lfence();
    t2 = __rdtsc();         // set stop time
    _mm_lfence();

    diff2 = t2 - t1;        // two fence statements are overhead
    printf( "tmp is %d\ndiff2 is %lu\n", tmp, diff2 );


    /* READ HIT*/
    _mm_lfence();           // fence to keep load order
    t1 = __rdtsc();         // set start time
    _mm_lfence();
    tmp = array[ 30 ];   // read the first elemet => cache hit
    _mm_lfence();
    t2 = __rdtsc();         // set stop time
    _mm_lfence();

    diff3 = t2 - t1;        // two fence statements are overhead
    printf( "tmp is %d\ndiff3 is %lu\n", tmp, diff3 );


    /* measuring fence overhead */
    _mm_lfence();
    t1 = __rdtsc();
    _mm_lfence();
    _mm_lfence();
    t2 = __rdtsc();
    _mm_lfence();
    ov = t2 - t1;

    printf( "lfence overhead is %lu\n", ov );
    printf( "cache miss1 TSC is %lu\n", diff1-ov );
    printf( "cache miss2 (or hit due to prefetching) TSC is %lu\n", diff2-ov );
    printf( "cache hit TSC is %lu\n", diff3-ov );


    return 0;
}

And output is 输出是

# gcc -O3 -o simple_flush simple_flush.c
# taskset -c 0 ./simple_flush
tmp is 30
diff1 is 529
tmp is 70
diff2 is 222
tmp is 30
diff3 is 46
lfence overhead is 32
cache miss1 TSC is 497
cache miss2 (or hit due to prefetching) TSC is 190
cache hit TSC is 14
# taskset -c 0 ./simple_flush
tmp is 30
diff1 is 486
tmp is 70
diff2 is 276
tmp is 30
diff3 is 46
lfence overhead is 32
cache miss1 TSC is 454
cache miss2 (or hit due to prefetching) TSC is 244
cache hit TSC is 14
# taskset -c 0 ./simple_flush
tmp is 30
diff1 is 848
tmp is 70
diff2 is 222
tmp is 30
diff3 is 46
lfence overhead is 34
cache miss1 TSC is 814
cache miss2 (or hit due to prefetching) TSC is 188
cache hit TSC is 12

There are some problems with the output for reading array[70] . 读取array[70]的输出存在一些问题array[70] 。 The TSC is neither hit nor miss. TSC既不受欢迎也不错过。 I had flushed that item similar to array[30] . 我刷了那个类似于array[30]项目array[30] 。 One possibility is that when array[40] is accessed, the HW prefetcher brings array[70] . 一种可能性是当访问array[40] ，HW预取器带来array[70] 。 So, that should be a hit. 所以，这应该是一个打击。 However, the TSC is much more than a hit. 然而，TSC远远不止于此。 You can verify that the hit TSC is about 20 when I try to read array[30] for the second time. 当我第二次尝试读取array[30]时，您可以验证命中TSC是否大约为20。

Even, if array[70] is not prefetched, the TSC should be similar to a cache miss. 即使如果没有预取array[70] ，TSC也应该类似于高速缓存未命中。

Is there any reason for that? 有什么理由吗？

UPDATE1: UPDATE1：

In order to make an array read, I tried (void) *((int*)array+i) as suggested by Peter and Hadi. 为了使数组读取，我按照Peter和Hadi的建议尝试(void) *((int*)array+i) 。

In the output I see many negative results. 在输出中，我看到许多负面结果。 I mean the overhead seems to be larger than (void) *((int*)array+i) 我的意思是开销似乎大于(void) *((int*)array+i)

UPDATE2: UPDATE2：

I forgot to add volatile . 我忘了添加volatile 。 The results are now meaningful. 结果现在很有意义。

Answer 1

First, note that the two calls to printf after measuring diff1 and diff2 may perturb the state of the L1D and even the L2. 首先，请注意在测量diff1和diff2之后对printf的两次调用可能会扰乱L1D甚至L2的状态。 On my system, with printf , the reported values for diff3-ov range between 4-48 cycles (I've configured my system so that the TSC frequency is about equal to the core frequency). 在我的系统上，使用printf ， diff3-ov的报告值在4-48个周期之间（我配置了我的系统，使得TSC频率大约等于核心频率）。 The most common values are those of the L2 and L3 latencies. 最常见的值是L2和L3延迟的值。 If the reported value is 8, then we've got our L1D cache hit. 如果报告的值是8，那么我们的L1D缓存命中。 If it is larger than 8, then most probably the preceding call to printf has kicked out the target cache line from the L1D and possibly the L2 (and in some rare cases, the L3!), which would explain the measured latencies that are higher than 8. @PeterCordes have suggested to use (void) *((volatile int*)array + i) instead of temp = array[i]; printf(temp) 如果它大于8，那么很可能前面对printf调用已经从L1D和可能的L2（以及在极少数情况下，L3！）中踢出目标缓存行，这将解释测量的延迟更高@PeterCordes 建议使用(void) *((volatile int*)array + i)而不是temp = array[i]; printf(temp) temp = array[i]; printf(temp) . temp = array[i]; printf(temp) 。 After making this change, my experiments show that most reported measurements for diff3-ov are exactly 8 cycles (which suggests that the measurement error is about 4 cycles), and the only other values that get reported are 0, 4, and 12. So Peter's approach is strongly recommended. 进行此更改后，我的实验表明，大多数报告的diff3-ov测量值恰好是8个周期（这表明测量误差大约为4个周期），并且报告的唯一其他值是0,4和12。强烈建议彼得的方法。

In general, the main memory access latency depends on many factors including the state of the MMU caches and the impact of the page table walkers on the data caches, the core frequency, the uncore frequency, the state and configuration of the memory controller and the memory chips with respect to the target physical address, uncore contention, and on-core contention due to hyperthreading. 通常，主存储器访问延迟取决于许多因素，包括MMU高速缓存的状态以及页表助手对数据高速缓存的影响，核心频率，非核心频率，存储器控制器的状态和配置以及关于目标物理地址，非核心争用以及由于超线程引起的核心争用的存储器芯片。 array[70] might be in a different virtual page (and physical page) than array[30] and their IPs of the load instructions and the addresses of the target memory locations may interact with the prefetchers in complex ways. array[70]可能位于与array[30]不同的虚拟页面（和物理页面）中，并且它们的加载指令的IP和目标存储器位置的地址可以以复杂的方式与预取器交互。 So there can be many reasons why cache miss1 is different from cache miss2 . 因此， cache miss1与cache miss2不同的原因可能有很多。 A thorough investigation is possible, but it would require a lot of effort as you might imagine. 可以进行彻底的调查，但需要付出很多努力才能想象。 Generally, if your core frequency is larger than 1.5 GHz (which is smaller than the TSC frequency on high-perf Intel processors), then an L3 load miss will take at least 60 core cycles. 通常，如果您的核心频率大于1.5 GHz（小于高性能Intel处理器上的TSC频率），则L3负载未命中将至少需要60个核心周期。 In your case, both miss latencies are over 100 cycles, so these are most likely L3 misses. 在您的情况下，两个未命中的延迟都超过100个周期，因此这些很可能是L3未命中。 In some extremely rare cases though, cache miss2 seems to be close to the L3 or L2 latency ranges, which would be due to prefetching. 在一些非常罕见的情况下， cache miss2似乎接近L3或L2延迟范围，这可能是由于预取。

I've determined that the following code gives a statistically more accurate measurement on Haswell: 我已经确定以下代码在Haswell上给出了统计上更准确的测量：

t1 = __rdtscp(&dummy);
tmp = *((volatile int*)array + 30);
asm volatile ("add $1, %1\n\t"
              "add $1, %1\n\t"
              "add $1, %1\n\t"
              "add $1, %1\n\t"
              "add $1, %1\n\t"
              "add $1, %1\n\t"
              "add $1, %1\n\t"
              "add $1, %1\n\t"
              "add $1, %1\n\t"
              "add $1, %1\n\t"
              "add $1, %1\n\t"
          : "+r" (tmp));          
t2 = __rdtscp(&dummy);
t2 = __rdtscp(&dummy);
loadlatency = t2 - t1 - 60; // 60 is the overhead

The probability that loadlatency is 4 cycles is 97%. loadlatency为4个周期的概率为97％。 The probability that loadlatency is 8 cycles is 1.7%. loadlatency为8个周期的概率为1.7％。 The probability that loadlatency takes other values is 1.3%. loadlatency采用其他值的概率为1.3％。 All of the other values are larger than 8 and multiple of 4. I'll try to add an explanation later. 所有其他值都大于8和4的倍数。我稍后会尝试添加解释。

Answer 2

Some ideas: 一些想法：

Perhaps a[70] was prefetched into some level of cache besides L1? 除了L1之外，或许[70]被预取到某个级别的缓存中？
Perhaps some optimization in DRAM causes this access to be fast, for instance maybe the row buffer is left open after accessing a[30]. 也许DRAM中的某些优化会导致访问速度很快，例如，在访问[30]后行缓冲区可能会保持打开状态。

You should investigate other access besides a[30] and a[70] to see if you get different numbers. 除了[30]和[70]之外，你应该调查其他访问，看看你是否得到不同的数字。 Eg do you get the same timings for hit on a[30] followed by a[31] (which should be fetched in the same line as a[30], if you use aligned_alloc with 64 byte alignment). 例如，如果你使用带有64字节对齐的aligned_alloc ，你会得到相同的时间来命中[30]后跟一个[31]（应该在与[30]相同的行中获取）。 And do other elements like a[69] and a[71] give the same timings as a[70]? 那些像[69]和[71]这样的元素会给出与[70]相同的时间吗？

使用时间戳计数器测量内存延迟

问题描述

2 个解决方案

解决方案1
3 已采纳 2018-08-29 22:22:29

解决方案2
1 2018-08-29 19:03:32

使用时间戳计数器测量内存延迟

问题描述

2 个解决方案

解决方案1 3 已采纳 2018-08-29 22:22:29

解决方案2 1 2018-08-29 19:03:32

解决方案1
3 已采纳 2018-08-29 22:22:29

解决方案2
1 2018-08-29 19:03:32