简体   繁体   中英

How to clear L1, L2 and L3 caches?

I am doing some cache performance measuring and I need to ensure the caches are empty of "useful" data before timing.

Assuming an L3 cache is 10MB would it suffice to create a vector of 10M/4 = 2,500,000 floats, iterate through the whole of this vector, sum the numbers and that would empty the whole cache of any data which was in it prior to iterating through the vector?

Yes, that should be sufficient for flushing the L3 cache of useful data.

I have done similar types of measurements and cross-verified by using Intel's cache counters to verify that I incur the expected number of L3 cache misses during my tests.

If you want to absolutely sure, you should also use the counters. In particular, you can measure last-level cache misses by using Event select 2EH, Umask 41H in most Intel architectures.

See the Intel Manual for details on these counters.

It depends on how insane you are trying to be to get your guarantee.

x86_64 L3 cache is physically indexed, and while a 10MiB chunk that's linear in virtual space is almost definitely going to be physically contiguous on a lightly mem-loaded machine, it's not guaranteed.

Sandy and Ivy Bridge, for example, have L3 cache in 2MiB slices with 16-way set associativity (128kiB stride), so you could guarantee physical coverage by doing a MAP_HUGETLB mmap() call, assuming standard 2-4MiB huge pages.

Also, since each slice (on new Sandy/Ivy Bridge at least) is attached to a different core, and which slice a given physical address resides on is determined by a hash of some low/middle-order address bits, you might have to make an array slightly larger than the size of L3 to counter for minutely uneven overlap.

At this point, scrubbing your array a few times linearly should do the trick.

Another option is to use dedicated cache invalidation instructions that some ISAs provide. x86 for eg has wbinvd for this purpose (or clflush for a single line).

http://x86.renejeschke.de/html/file_module_x86_id_325.html

One problem is that it requires ring-0 permissions. Another one is that it doesn't guarantee that the flush is completed prior to any serialization point, so it's not good enough to guarantee system non-volatility, but it may be enough for benchmarking as long as you can prevent the ensuing WBs from eating up your memory bandwidth.

If you can overcome these issues, it may be a better solution in some cases than going over some large data structure just to make sure the cache is flushed. Some CPUs may decide to avoid caching fetches they believe would not be reused in the future (there are several papers about these options, and at least some claims that it's implemented in real CPUs)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM