简体繁体 English

缓存预取方案-电源架构

[英]Cache prefetching scenario - power architecture

原文 2013-06-14 14:08:50 4 1 c++/ performance/ caching/ powerpc/ prefetch

I'm using the asm dcbt command to touch a range of memory I know will be required for performing certain computations onto. 我正在使用asm dcbt命令触摸一定范围的内存，我知道执行一定的计算将需要这些内存。 My profiler shows a pattern of cache misses because of the sporadic access to elements inside this range (4 touched, 5 skipped and so on - producing a cache miss each 5th operation). 我的探查器显示了高速缓存未中的一种模式，这是因为零星访问了该范围内的元素（触摸了4个，跳过了5个，依此类推-每执行第5个操作都会产生高速缓存未命中）。

There is a function A() that has access to the exact range and its size. 有一个函数A()可以访问确切的范围及其大小。 This A() function is called before another section that will also touch and use data from the range A() utilizes. 该A()函数在另一节之前被调用，该节也将接触和使用A()范围内A()数据。 Can I just use dcbt inside A() and then expect an improvement in B() , or do I have to use dcbt on the range in the same function that utilizes that collection of data? 我可以只在A()内使用dcbt然后期望B()有所改善，还是必须在利用该数据收集的同一函数中的范围上使用dcbt ？

1 个解决方案

Assuming ALL the data used in A() fits in the cache, you should see improvement in B() too. 假设A()使用的所有数据都适合高速缓存，那么B()也应有所改进。 However, you can also end up reading data into the cache that isn't being used, which serves no purpose to anything, and just causes the memory bus to be busy when it could be used to load some ACTUAL data that is needed, if your pattern is as sporadic as you say. 但是，您还可能最终将没有使用的数据读入高速缓存，这对任何目的都没有任何作用，并且如果可以用来加载某些所需的ACTUAL数据，只会导致内存总线繁忙。您的模式像您所说的一样零星。 By all means give it a try, but don't expect it to magically work effectively - it often takes a bit of "tuning" - particularly with regard to "how far ahead of where you are right now do you read the data". 一定要尝试一下，但是不要指望它能有效地神奇地工作-它经常需要一些“调整”-特别是关于“您现在在读取数据的位置有多远”。

Depending on the exact behaviour of A() and B() , for example if you are switching between reads and writes, and reading from one section and writing to a completely different section, batching up the writes to a "holding area", which is then copied to RAM is often a good plan - make the holding area something like 1/8-1/4 of the L1 cache. 取决于A()和B()的确切行为，例如，如果要在读和写之间切换，并从一个部分读取并写入一个完全不同的部分，则将写操作分批存放到“保存区域”，然后将其复制到RAM中通常是一个不错的计划-将保留区域设置为L1缓存的1 / 8-1 / 4。

[Caveat: I've got absolutely no experience at all with PowerPC architecture, but I have used cache prefetching and other memory optimisation techniques in my work with x86 processors, with some success at times, not so much success at other times] [注意：我对PowerPC架构完全没有任何经验，但是在x86处理器上使用缓存预取和其他内存优化技术时，有时会取得一些成功，而在某些时候则不会那么成功。