简体   繁体   English

同一程序在多次运行之间的不同缓存未命中计数

[英]Different cache miss count for a same program between multiple runs

I am using Cachegrind to retrieve the number of cache misses of a static program compiled without libc (just a _start that calls my main function and an exit syscall in asm). 我正在使用Cachegrind来检索没有libc编译的静态程序的高速缓存未命中数(只是调用我的主函数的_start和asm中的退出syscall)。 The program is fully deterministic, the instructions and the memory references does not change from one run to another. 该程序是完全确定性的,指令和内存引用不会从一次运行更改为另一次运行。 The cache is fully-associative with LRU as replacement policy. 缓存与LRU作为替换策略完全关联。

However, I noticed that the number of misses changes sometimes. 但是,我注意到错过的次数有时会发生变化。 More specifically, the number of misses is always the same until I go to a different directory: 更具体地说,在我转到其他目录之前,未命中数始终相同:

 % cache=8 && valgrind --tool=cachegrind --I1=$((cache * 64)),$cache,64 --D1=$((cache * 64)),$cache,64 --L2=262144,4096,64 ./adpcm        
 ...
 ==31352== I   refs:      216,145,010
 ...
 ==31352== D   refs:      130,481,003  (95,186,001 rd   + 35,295,002 wr)
 ==31352== D1  misses:        240,004  (   150,000 rd   +     90,004 wr)
 ==31352== LLd misses:             31  (        11 rd   +         20 wr)

And if I execute the same command again and again, I will keep having the same results. 而且,如果我一次又一次地执行相同的命令,我将保持相同的结果。 But if I run this program from a different directory: 但是,如果我从其他目录运行此程序:

 % cd ..
 % cache=8 && valgrind --tool=cachegrind --I1=$((cache * 64)),$cache,64 --D1=$((cache * 64)),$cache,64 --L2=262144,4096,64 ./malardalen2/adpcm
 ...
 ==31531== I   refs:      216,145,010
 ...
 ==31531== D   refs:      130,481,003  (95,186,001 rd   + 35,295,002 wr)
 ==31531== D1  misses:        250,004  (   160,000 rd   +     90,004 wr)
 ==31531== LLd misses:             31  (        11 rd   +         20 wr)

And I even have a different result from a different directory. 我什至从不同的目录获得不同的结果。

I've also done some experiments with a Pin tool and with this one I don't need to change the directory to get different values. 我还使用了Pin工具进行了一些实验,使用该工具,我无需更改目录即可获得不同的值。 But it seems that the set of possible values is very limited and is exactly the same as with Cachegrind. 但是似乎可能的值集非常有限,并且与Cachegrind完全相同。

My question is: what could be the sources of such differences? 我的问题是:这种差异的根源什么?

My first hint is that my program is not aligned the same way in memory and as a consequence, some variables stored in the same line in a previous run are not anymore. 我的第一个提示是,我的程序在内存中的排列方式不一致,因此,上一次运行中存储在同一行中的某些变量不再存在。 That could also explain the limited number of combinations. 这也可以解释有限数量的组合。 But I though that cachegrind (and Pin) were using the virtual addresses and I'd assume that the OS (Linux) is always giving the same virtual addresses. 但是我虽然那个cachegrind(和Pin)使用的是虚拟地址,但我假设OS(Linux)总是提供相同的虚拟地址。 Any other idea? 还有其他想法吗?

Edit: As you can guess reading the LLd misses, the program only uses 31 different cache lines. 编辑:正如您可以猜测读取LLd丢失一样,该程序仅使用31条不同的缓存行。 Also, the cache can only contain 8 cache lines. 此外,缓存只能包含8个缓存行。 So even on real, the difference can't be explained by the idea of the cache being already populated the second time (at max, only 8 lines could stay in the L1). 因此,即使是实际情况,也无法通过第二次填充缓存的想法来解释差异(最多只能在L1中保留8行)。

Edit 2: Cachegrind's report is not based on actual cache misses (given by performance counters) but is the result of a simulation. 编辑2: Cachegrind的报告不是基于实际的高速缓存未命中(由性能计数器提供),而是模拟的结果。 Basically, it simulate the behavior of a cache in order to count the number of misses. 基本上,它模拟高速缓存的行为以计算未命中数。 Since the consequences are only temporal, that's totally fine and that allows to change the cache properties (size, associativity). 由于后果只是暂时的,因此完全可以,并且可以更改缓存属性(大小,关联性)。

Edit 3: The hardware I am using is an Intel Core i7 on a Linux 3.2 x86_64. 编辑3:我使用的硬件是Linux 3.2 x86_64上的Intel Core i7。 The compile flags are -static and for some programs -nostdlib (IIRC, I'm not at home right now). 编译标志是-static,对于某些程序是-nostdlib(IIRC,我现在不在家)。

Linux implements the "Address space layout randomization" technique ( http://en.wikipedia.org/wiki/Address_space_layout_randomization ) for security matters. Linux针对安全性问题实施了“地址空间布局随机化”技术( http://en.wikipedia.org/wiki/Address_space_layout_randomization )。 And you can deactivate this behavior like this: 您可以像这样停用此行为:

echo -n "0" > /proc/sys/kernel/randomize_va_space

You can test that through this example: 您可以通过以下示例进行测试:

#include <stdio.h>

int main() {
   char a;
   printf("%u\n", &a);
   return 0;
}

You should always have the same value printed. 您应该始终打印相同的值。

Before: 之前:

 % ./a.out
4006500239
 % ./a.out
819175583
 % ./a.out
2443759599
 % ./a.out
2432498159

After: 后:

 % ./a.out
4294960207
 % ./a.out
4294960207
 % ./a.out
4294960207
 % ./a.out
4294960207

That also explain the different amount of cache misses since two variables that were in the same line can now be in two different lines. 这也解释了高速缓存未中的不同数量,因为同一行中的两个变量现在可以在两个不同行中。

Edit: This does not solve entirely the problem apparently but I think it was one of the reason. 编辑:这显然不能完全解决问题,但我认为这是原因之一。 I'll give the bounty to anyone who can help me resolving this issue. 我将向所有可以帮助我解决此问题的人提供赏金。

It seems this is a known behavior in valgrind: 似乎这是valgrind中的已知行为:

I used the example that outputs the cache base address, I also disabled the layout randomization. 我使用了输出缓存基地址的示例,同时也禁用了布局随机化。

I ran the executable twice getting the same results in both runs: 我两次运行可执行文件,两次运行都得到相同的结果:

D   refs:       40,649  (28,565 rd   + 12,084 wr)
==15016== D1  misses:     11,465  ( 8,412 rd   +  3,053 wr)
==15016== LLd misses:      1,516  ( 1,052 rd   +    464 wr)
==15016== D1  miss rate:    28.2% (  29.4%     +   25.2%  )
==15016== LLd miss rate:     3.7% (   3.6%     +    3.8%  )

villar@localhost ~ $ cache=8 && valgrind --tool=cachegrind --I1=$((cache * 64)),$cache,64 --D1=$((cache * 64)),$cache,64 --L2=262144,4096,64 ./a.out 

==15019== D   refs:       40,649  (28,565 rd   + 12,084 wr)
==15019== D1  misses:     11,465  ( 8,412 rd   +  3,053 wr)
==15019== LLd misses:      1,516  ( 1,052 rd   +    464 wr)
==15019== D1  miss rate:    28.2% (  29.4%     +   25.2%  )
==15019== LLd miss rate:     3.7% (   3.6%     +    3.8%  )

According to the cachegrind documentation ( http://www.cs.washington.edu/education/courses/cse326/05wi/valgrind-doc/cg_main.html ) 根据cachegrind文档( http://www.cs.washington.edu/education/courses/cse326/05wi/valgrind-doc/cg_main.html

Another thing worth nothing is that results are very sensitive. 另外一文不值的是,结果非常敏感。 Changing the size of the >valgrind.so file, the size of the program being profiled, or even the length of its name can perturb the results. 更改> valgrind.so文件的大小,正在分析的程序的大小甚至其名称的长度都可能干扰结果。 Variations will be small, but don't expect perfectly >repeatable results if your program changes at all. 变化很小,但是如果您的程序发生了根本变化,就不会期望得到完全可重复的结果。 While these factors mean you shouldn't trust the results to be super-accurate, hopefully >they should be close enough to be useful. 尽管这些因素意味着您不应该相信结果是超级准确的,但希望它们应该足够接近才有用。

After reading this, I changed the file name and got the following: 阅读此内容后,我更改了文件名并得到以下内容:

villar@localhost ~ $ mv a.out a.out2345345345
villar@localhost ~ $ cache=8 && valgrind --tool=cachegrind --I1=$((cache * 64)),$cache,64 --D1=$((cache * 64)),$cache,64 --L2=262144,4096,64 ./a.out2345345345 

==15022== D   refs:       40,652  (28,567 rd   + 12,085 wr)
==15022== D1  misses:     10,737  ( 8,201 rd   +  2,536 wr)
==15022== LLd misses:      1,517  ( 1,054 rd   +    463 wr)
==15022== D1  miss rate:    26.4% (  28.7%     +   20.9%  )
==15022== LLd miss rate:     3.7% (   3.6%     +    3.8%  )

Changing the name back to "a.out" gave me exactly the same result as before. 将名称改回“ a.out”可以得到与以前完全相同的结果。

Notice that changing the file name or the path to it will change the base of the stack!!. 请注意,更改文件名或其路径将更改堆栈的基础!! and this may be the cause after reading what Mr. Evgeny said in a prior comment 这可能是阅读了叶夫根尼(Evgeny)先生先前评论中所说的原因

When you change current working directory, you also change corresponding environment variable (and its length). 更改当前工作目录时,还更改了相应的环境变量(及其长度)。 Since a copy of all environment variables is usually stored just above the stack, you get different allocation for stack variables and different number of cache misses. 由于通常将所有环境变量的副本存储在堆栈的正上方,因此您将获得不同的堆栈变量分配和不同数量的高速缓存未命中。 (And shell could change some other variables besides "PWD"). (并且Shell可以更改“ PWD”之外的其他一些变量)。

EDIT: Documentation also says: 编辑:文档还说:

Program start-up/shut-down calls a lot of functions that aren't interesting and just complicate the output. 程序启动/关闭会调用很多并不有趣的函数,而只是使输出复杂化。 Would be nice to exclude these somehow. 最好以某种方式排除这些。

The simulated cache may be tracking the start and end of the program being it the source of the variations. 模拟的高速缓存可能正在跟踪程序的开始和结束,因为它是变化的来源。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM