簡體   English   中英

緩存行 alignment 優化未減少緩存未命中

[英]Cache line alignment optimization not reducing cache miss

我得到了這段代碼,演示了緩存行 alignment 優化如何通過減少http://blog.kongfy.com/2016/10/cache-coherence-sequential-consistency-and-memory-barrier/中的“錯誤共享”來工作

代碼:

/*
 * Demo program for showing the drawback of "false sharing"
 *
 * Use it with perf!
 *
 * Compile: g++ -O2 -o false_share false_share.cpp -lpthread
 * Usage: perf stat -e cache-misses ./false_share <loopcount> <is_aligned>
 */

#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>
#include <sys/resource.h>

#define CACHE_ALIGN_SIZE 64
#define CACHE_ALIGNED __attribute__((aligned(CACHE_ALIGN_SIZE)))

int gLoopCount;

inline int64_t current_time()
{
  struct timeval t;
  if (gettimeofday(&t, NULL) < 0) {
  }
  return (static_cast<int64_t>(t.tv_sec) * static_cast<int64_t>(1000000) + static_cast<int64_t>(t.tv_usec));
}

struct value {
  int64_t val;
};
value data[2] CACHE_ALIGNED;

struct aligned_value {
  int64_t val;
} CACHE_ALIGNED;
aligned_value aligned_data[2] CACHE_ALIGNED;

void* worker1(int64_t *val)
{
  printf("worker1 start...\n");

  volatile int64_t &v = *val;
  for (int i = 0; i < gLoopCount; ++i) {
    v += 1;
  }

  printf("worker1 exit...\n");
}

// duplicate worker function for perf report
void* worker2(int64_t *val)
{
  printf("worker2 start...\n");

  volatile int64_t &v = *val;
  for (int i = 0; i < gLoopCount; ++i) {
    v += 1;
  }

  printf("worker2 exit...\n");
}

int main(int argc, char *argv[])
{
  pthread_t race_thread_1;
  pthread_t race_thread_2;

  bool is_aligned;

  /* Check arguments to program*/
  if(argc != 3) {
    fprintf(stderr, "USAGE: %s <loopcount> <is_aligned>\n", argv[0]);
    exit(1);
  }

  /* Parse argument */
  gLoopCount = atoi(argv[1]); /* Don't bother with format checking */
  is_aligned = atoi(argv[2]); /* Don't bother with format checking */

  printf("size of unaligned data : %d\n", sizeof(data));
  printf("size of aligned data   : %d\n", sizeof(aligned_data));

  void *val_0, *val_1;
  if (is_aligned) {
    val_0 = (void *)&aligned_data[0].val;
    val_1 = (void *)&aligned_data[1].val;
  } else {
    val_0 = (void *)&data[0].val;
    val_1 = (void *)&data[1].val;
  }

  int64_t start_time = current_time();

  /* Start the threads */
  pthread_create(&race_thread_1, NULL, (void* (*)(void*))worker1, val_0);
  pthread_create(&race_thread_2, NULL, (void* (*)(void*))worker2, val_1);

  /* Wait for the threads to end */
  pthread_join(race_thread_1, NULL);
  pthread_join(race_thread_2, NULL);

  int64_t end_time = current_time();

  printf("time : %d us\n", end_time - start_time);

  return 0;
}

預期性能結果:

[jingyan.kfy@OceanBase224006 work]$ perf stat -e cache-misses ./false_share 100000000 0
size of unaligned data : 16
size of aligned data   : 128
worker2 start...
worker1 start...
worker1 exit...
worker2 exit...
time : 452451 us

 Performance counter stats for './false_share 100000000 0':

         3,105,245 cache-misses

       0.455033803 seconds time elapsed

[jingyan.kfy@OceanBase224006 work]$ perf stat -e cache-misses ./false_share 100000000 1
size of unaligned data : 16
size of aligned data   : 128
worker1 start...
worker2 start...
worker1 exit...
worker2 exit...
time : 326994 us

 Performance counter stats for './false_share 100000000 1':

            27,735 cache-misses

       0.329737667 seconds time elapsed

但是,我自己運行代碼並獲得了非常接近的運行時間,當未對齊時,緩存未命中計數甚至更低:

我的結果:

$ perf stat -e cache-misses ./false_share 100000000 0
size of unaligned data : 16
size of aligned data   : 128
worker1 start...
worker2 start...
worker2 exit...
worker1 exit...
time : 169465 us

 Performance counter stats for './false_share 100000000 0':

            37,698      cache-misses:u                                              

       0.171625603 seconds time elapsed

       0.334919000 seconds user
       0.001988000 seconds sys


$ perf stat -e cache-misses ./false_share 100000000 1
size of unaligned data : 16
size of aligned data   : 128
worker2 start...
worker1 start...
worker2 exit...
worker1 exit...
time : 118798 us

 Performance counter stats for './false_share 100000000 1':

            38,375      cache-misses:u                                              

       0.121072715 seconds time elapsed

       0.230043000 seconds user
       0.001973000 seconds sys

我應該如何理解這種不一致?

由於您引用的博客是中文的,因此很難提供幫助。 不過,我注意到第一個圖似乎顯示了多套接字架構。 所以我做了一些實驗。

a) 我的電腦,Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz,單插槽,兩個內核,每個內核兩個三個:

0:

time : 195389 us

 Performance counter stats for './a.out 100000000 0':

             8 980      cache-misses:u                                              

       0,198584628 seconds time elapsed

       0,391694000 seconds user
       0,000000000 seconds sys

和 1:

time : 191413 us

 Performance counter stats for './a.out 100000000 1':

             9 020      cache-misses:u                                              

       0,192953853 seconds time elapsed

       0,378434000 seconds user
       0,000000000 seconds sys

差別不大。

b) 現在是 2 路工作站

每個內核的線程數:2
每個插槽的核心數:12
插座:2
NUMA 節點:2
Model 名稱:Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz

0:

time : 454679 us

 Performance counter stats for './a.out 100000000 0':

         5,644,133      cache-misses                                                

       0.456665966 seconds time elapsed

       0.738173000 seconds user

1:

time : 346871 us

 Performance counter stats for './a.out 100000000 1':

            42,217      cache-misses                                                

       0.348814583 seconds time elapsed

       0.539676000 seconds user
       0.000000000 seconds sys

差異是巨大的。


最后一句話。 你寫:

未對齊時,緩存未命中計數甚至更低

不,不是。 除了程序之外,您的處理器正在運行各種任務。 此外,您正在運行 2 個線程,這些線程可能會以不同的時間順序訪問緩存。 所有這些都可能影響緩存利用率。 您需要多次重復測量並進行比較。 就個人而言,當我看到任何性能結果的差異小於 10% 時,我認為它們無法區分。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM