简体   繁体   English

我如何知道我的阵列存储在哪个缓存级别?

[英]How do I know on which cache level my array is stored?

I understand, that spatial and temporal locality have an enourmous impact on performance.我知道,空间和时间的局部性对性能有巨大的影响。 What I don't understand is how my data structures are stored in these caches?我不明白我的数据结构是如何存储在这些缓存中的? For simplicity, assume the L1 cache has 8 bytes, the L2 16 and the L3 32 bytes.为简单起见,假设 L1 缓存有 8 个字节,L2 有 16 个字节,L3 有 32 个字节。 Does that mean that if we have:这是否意味着如果我们有:

std::array<double, 1> x = {1.}; 
std::array<double, 2> y = {1.,2.}; 
std::array<double, 4> z = {1.,2.,3.,4.};

And some function calls these arrays, will x be loaded in the L1 cache, y in L2 and z in L3?还有一些 function 调用这些 arrays,x 会被加载到 L1 缓存中,y 在 L2 中,z 在 L3 中吗? Or will y - for example be splitted over the L1 and L2 caches??或者将 y - 例如拆分为 L1 和 L2 缓存?

Will splitting these arrays manualy yield better cache localitly?手动拆分这些 arrays 会在本地产生更好的缓存吗? If I do for example something like:例如,如果我这样做:

std::array<std::array<double,2>,2> z;

Will z be splitted across the cache levels when a function calls it?当 function 调用它时,z 会在缓存级别上拆分吗?

What about cachelines?缓存线呢? these are usualy 64 bytes long - Will splitting my arrays into arrays of arrays of 64 bytes yield better access speed?这些通常是 64 字节长 - 将我的 arrays 拆分为 64 字节的 arrays 的 arrays 会产生更好的访问速度吗?

std::array<std::array<double,8>,2> u;

I find this subject quite confusing and would appreciate any help我觉得这个主题很混乱,希望能提供任何帮助

You are thinking about the caches the wrong way.您正在以错误的方式考虑缓存。

You can only see which cache has them with special tools (intel debugger comes to mind) and the results will be specific to your run and architecture.您只能使用特殊工具(想到英特尔调试器)查看哪些缓存具有它们,并且结果将特定于您的运行和架构。 Changing the processor can break your setup fairly easy.更换处理器可以很容易地破坏您的设置。

That said you can try to have use solutions that are cache friendly.也就是说,您可以尝试使用缓存友好的解决方案。

The way the caches work is this: Say you want to read you x[0] .缓存的工作方式是这样的:假设您想读取x[0] Your program will make a request for the memory location associated with it.您的程序将请求与其关联的 memory 位置。 It will be intercepted by L1.它将被L1拦截。 If L1 can give you the value (because it's in a block that it stores already) it will.如果 L1 可以给你这个值(因为它已经在一个已经存储的块中)它会。 If not the request will be intercepted by L2 and so on.如果不是,请求将被 L2 拦截,依此类推。 If no cache levels has that block it will be requested from RAM.如果没有缓存级别具有该块,它将从 RAM 请求。

Now, it's inefficient to read just 4 bytes from RAM because there's overhead.现在,从 RAM 中仅读取 4 个字节是低效的,因为存在开销。 So actually you're going to read a L3 block from ram, that includes the bytes you want.所以实际上你将从 ram 中读取一个 L3 块,其中包括你想要的字节。 It can happen that you might have to read 2 blocks because your data is split between them (compilers try to avoid this).您可能必须读取 2 个块,因为您的数据在它们之间被拆分(编译器试图避免这种情况)。 The the L2 block sized chunk is sent to L2 cache to be stored and a L1 sized chunk to L1, all including the bytes you want (the bytes might be in the middle somewhere). L2 块大小的块被发送到 L2 缓存进行存储,L1 大小的块被发送到 L1,所有这些都包括您想要的字节(字节可能在中间某处)。 For the next request (say 'x[1]') the same thing happens.对于下一个请求(比如'x[1]'),同样的事情会发生。 If the next access was close to the last one then you'll likely get the result from L1.如果下一次访问接近上一次访问,那么您可能会从 L1 获得结果。 I say likely because your program might have been suspended and resumed on a different core or processor which has a different L1.我说可能是因为您的程序可能已在具有不同 L1 的不同内核或处理器上暂停和恢复。

Trying to design for a specific setup is usually a bad idea (unless you really need that last few % of performance and you've already tried everything else).尝试为特定设置进行设计通常是一个坏主意(除非您真的需要最后几%的性能并且您已经尝试过其他所有方法)。

The rule of thumb is to keep accessing memory that's next to each other.经验法则是继续访问彼此相邻的 memory。 The thing to avoid is accessing few bytes that are far apart.要避免的事情是访问相距很远的几个字节。 Going through an array is very fast.遍历数组非常快。 Try to implement a linear search and a binary search over the same sorted array and see how long the array needs to be before you are getting significant better performance out of the binary search (last time I went to about >100 ints).尝试在同一个排序数组上实现线性搜索和二进制搜索,看看数组需要多长时间才能从二进制搜索中获得显着更好的性能(上次我去了大约 >100 个整数)。

In your example, if you access first all elements of x then move on to y and so on the setup is good.在您的示例中,如果您首先访问x的所有元素,然后转到y等等,则设置很好。 If instead you are accessing x[i], y[i], z[i] then x[i+1], y[i+1], z[i+1] then maybe having a struct with {x,y,z} and having that in an array would be better (you need to benchmark to know for sure).相反,如果您访问的是x[i], y[i], z[i]然后x[i+1], y[i+1], z[i+1]那么可能有一个结构体 {x,y ,z} 并将其放在数组中会更好(您需要进行基准测试才能确定)。

And some function calls these arrays, will x be loaded in the L1 cache, y in L2 and z in L3?还有一些 function 调用这些 arrays,x 会被加载到 L1 缓存中,y 在 L2 中,z 在 L3 中吗? Or will y - for example be splitted over the L1 and L2 caches??或者将 y - 例如拆分为 L1 和 L2 缓存?

They will all be in all L1, L2, L3 caches loaded as you access them.它们都将在您访问它们时加载的所有 L1、L2、L3 缓存中。 If you access often enough you get them from a lower level cache.如果您经常访问,您可以从较低级别的缓存中获取它们。

Will splitting these arrays manualy yield better cache localitly?手动拆分这些 arrays 会在本地产生更好的缓存吗?

No. The processor's memory management handles the splits.不会。处理器的 memory 管理处理拆分。 The cache locality depends on how often you access a particular part of memory.缓存位置取决于您访问 memory 特定部分的频率。 It's better to have all the accesses bunched up rather than spread out over time.最好将所有访问集中在一起,而不是随着时间的推移分散开来。

What about cachelines?缓存线呢? these are usualy 64 bytes long - Will splitting my arrays into arrays of arrays of 64 bytes yield better access speed?这些通常是 64 字节长 - 将我的 arrays 拆分为 64 字节的 arrays 的 arrays 会产生更好的访问速度吗?

No. Likely you won't see any difference.不,您可能看不到任何区别。 The arrays are split automatically by the memory management stuff in the processor. arrays 由处理器中的 memory 管理内容自动拆分。 And again, don't over-optimize for you current processor architecture, the CPU you buy tomorrow might have twice as long cache-lines out of the box.再说一次,不要为你当前的处理器架构过度优化,你明天购买的 CPU 可能有两倍长的开箱即用的缓存线。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM