[英]When I have per-CPU data structures, does it improve performance to have them on different pages?
I have a small struct of per-CPU data in a linux kernel module, where each CPU frequently writes and reads its own data. 我在Linux内核模块中有一个小的每CPU数据结构,每个CPU经常写入和读取自己的数据。 I know that I need to make sure these items of data aren't on the same cache line, because if they were then the cores would be forever dirtying each other's caches.
我知道我需要确保这些数据项不在同一个缓存行中,因为如果它们是核心将永远弄脏对方的缓存。 However, is there anything at the page level that I need to worry about from an SMP performance point of view?
但是,从SMP性能的角度来看,我需要担心页面级别的任何内容吗? ie.
即。 would there be any performance impact from padding these per-cpu structures out to 4096 bytes and aligning them?
将这些per-cpu结构填充到4096字节并对齐它们会产生任何性能影响吗?
This is on linux 2.6 on x86_64. 这是在x86_64上的linux 2.6上。
(Points about whether it's worth optimising and suggestions that I go benchmark it aren't needed -- what I'm looking for is whether there's any theoretical basis for worrying about page alignment). (关于是否值得优化以及建议我进行基准测试是不需要的 - 我正在寻找的是是否有任何理论基础来担心页面对齐)。
Within a single NUMA node, different pages are only helpful if you want to apply different permissions, or map them individually into processes. 在单个NUMA节点中,只有在要应用不同权限或将它们单独映射到进程时,不同页面才有用。 For performance issues, being on different cachelines is sufficient.
对于性能问题,在不同的高速缓存行上就足够了。
On NUMA architectures, you may want to place a CPU's per-CPU structure on a page that is local to that CPU's node - but you still wouldn't pad the structure out to a page size to achieve that, because you can place the structures for multiple CPUs within the same NUMA node on the same page. 在NUMA体系结构中,您可能希望将CPU的每CPU结构放置在该CPU节点本地的页面上 - 但是您仍然不会将结构填充到页面大小以实现该目标,因为您可以放置结构对于同一页面上同一NUMA节点内的多个CPU。
Even on a NUMA system, you probably won't benefit much by allocating memory pages local to each cpu (use kmalloc_node()
, if you're curious). 即使在NUMA系统上,通过为每个cpu分配本地的内存页面,你可能也不会受益很多
kmalloc_node()
如果你很好奇,请使用kmalloc_node()
)。
Node-local memory will be faster, but only in the case where it misses at all cache levels. 节点本地内存将更快,但仅限于它在所有高速缓存级别都未命中的情况。 For anything used with any frequency, you probably won't be able to tell the difference.
对于任何频率使用的任何东西,你可能无法区分它们。 If you're allocating megabytes of cpu-local data, then it probably makes sense to allocate pages local to each cpu.
如果你要分配兆字节的cpu-local数据,那么为每个cpu分配本地页面可能是有意义的。
percpu generally makes sure that they don't share a cache line. percpu通常会确保它们不共享缓存行。 Otherwise commits like 7489aec8eed4f2f1eb3b4d35763bd3ea30b32ef5 would have been pretty useless.
否则像7489aec8eed4f2f1eb3b4d35763bd3ea30b32ef5这样的提交本来就没用了。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.