简体繁体 English

std::vector 保留和调整 NUMA 位置的大小

[英]std::vector reserve & resize NUMA locality

原文 2020-08-26 21:11:25 0 1 c++/ multithreading/ numa

I'm currently looking into optimizing NUMA locality of my application.我目前正在研究优化我的应用程序的 NUMA 位置。

So far I think I understand that memory will be resident to that NUMA node that first touches it after allocation.到目前为止，我想我明白内存将驻留在分配后首先接触它的那个 NUMA 节点。

My questions in regards to std::vector (using the default allocator) are:我关于 std::vector （使用默认分配器）的问题是：

std::vector::reserve allocates new memory - but does it also touch it? std::vector::reserve 分配新内存 - 但它也触及它吗？ If not, how can I force to touch it after a call to reserve?如果没有，我如何在拨打预约电话后强制触摸它？
does std::vector::resize touch the memory? std::vector::resize 是否触及内存？
what about the constructor that takes size_t?带size_t的构造函数呢？

And about NUMA in general:总体而言，关于 NUMA：

If memory that already has been touched is paged out to disk and then is accessed again and generates a hard fault, does that count as a new first touch or is the page loaded into the memory resident to the numa node that touched it first originally?如果已经被触及的内存被分页到磁盘，然后再次访问并产生硬故障，这算作新的第一次触及还是页面加载到驻留在最初首先触及它的 numa 节点的内存中？
I'm using c++11 threads.我正在使用 c++11 线程。 So long as I'm inside a thread and allocating/touching new memory, can I be sure that all this memory will be resident to the same numa node, or is it possible that the OS switches the executing CPU underneath my thread while it executes and then some of my allocations will be in one NUMA domain and others in another?只要我在一个线程内并分配/接触新内存，我就可以确定所有这些内存都将驻留在同一个 numa 节点中，或者操作系统是否有可能在我的线程下切换执行 CPU 在它执行时然后我的一些分配将在一个 NUMA 域中，而其他分配在另一个域中？

1 个解决方案

Assuming we're talking about Intel CPUs: on their Nahlem vintage CPUs, if you had two such CPUs, there was a power-on option for telling them how to divide up physical memory between them.假设我们谈论的是 Intel CPU：在他们的 Nahlem 老式 CPU 上，如果您有两个这样的 CPU，则有一个开机选项可以告诉他们如何在它们之间分配物理内存。 The physical architecture is two CPUs connected by QPI, with each CPU controlling its own set of memory SIMMs.物理架构是由 QPI 连接的两个 CPU，每个 CPU 控制自己的一组内存 SIMM。 The options are,选项是，

first half of the physical address space on one CPU, second half on the next, or一个 CPU 上物理地址空间的前半部分，下一个 CPU 上的后半部分，或
alternating of memory pages between CPUs CPU之间内存页的交替

For the first option, if you allocated a piece of memory it'd be down to the OS where it would take that from in the physical address space, and then I suppose a good scheduler would endeavour to run threads accessing that memory on the CPU that's controlling it.对于第一个选项，如果您分配了一块内存，它将归结为操作系统，它将从物理地址空间中获取该内存，然后我想一个好的调度程序会努力运行线程访问 CPU 上的该内存这就是控制它。 For the second option, if you allocated several pages of memory then that'd be split between the two physical CPUs, and then it wouldn't really matter what the scheduler did with threads accessing it.对于第二个选项，如果您分配了几页内存，那么这将在两个物理 CPU 之间分配，然后调度程序对访问它的线程做了什么并不重要。 I actually played around with this briefly, and couldn't really spot the difference;我实际上只是简单地玩了一下，并不能真正发现区别； Intel had done a good job on the QPI.英特尔在 QPI 上做得很好。 I'm less familiar with newer Intel architectures, but I'm assuming that it's more of the same.我对较新的英特尔架构不太熟悉，但我假设它更相似。

The other question really is what do you mean by a NUMA node?另一个问题是 NUMA 节点是什么意思？ If we are referring to modern Intel and AMD CPUs, these present a synthesized SMP environment to software, using things like QPI / Hypertransport (and now their modern equivalents) to do so on top of a NUMA hardware architecture.如果我们指的是现代 Intel 和 AMD CPU，它们为软件提供了一个合成的 SMP 环境，使用 QPI/Hypertransport（现在它们的现代等效物）之类的东西在 NUMA 硬件架构之上这样做。 So when talking NUMA locality, it's really a case of whether or not the OS scheduler will run the thread on a core on a CPU that is controlling the RAM that the thread is accessing (SMP meaning that it can be run on any core and still access, though perhaps with slight latency differences, the memory no matter where in physical memory it was allocated).因此，在谈论 NUMA 局部性时，实际上是操作系统调度程序是否会在控制线程正在访问的 RAM 的 CPU 上的内核上运行线程（SMP 意味着它可以在任何内核上运行并且仍然访问，尽管可能有轻微的延迟差异，无论内存分配在物理内存的哪个位置）。 I don't know the answer, but I think that some will do that.我不知道答案，但我认为有些人会这样做。 Certainly endeavours I've made to use core affinity for threads and memory has yielded only a tiny improvement over just letting the OS (Linux 2.6) just do it's thing.当然，与让操作系统（Linux 2.6）只做它的事情相比，我为使用线程和内存的核心亲和性所做的努力只产生了微小的改进。 And the cache systems on modern CPUs and their interaction with inter-CPU interconnects like QPI are very clever.现代 CPU 上的缓存系统及其与 QPI 等 CPU 间互连的交互非常聪明。

Older OSes dating back to when SMP really was pure hardware SMP wouldn't know to do that.可以追溯到 SMP 真正是纯硬件时的旧操作系统 SMP 不知道这样做。

Small rabbit hole - if we are referring to a pure NUMA system (Transputers, the Cell processor out of the PS3 and its SPEs) then a thread would be running on a specific core and would be able to access only that core's memory;小兔子洞 - 如果我们指的是纯 NUMA 系统（Transputers，PS3 中的 Cell 处理器及其 SPE），那么线程将在特定核心上运行，并且只能访问该核心的内存； to access data allocated (by another thread) in another core's memory, the software has to sort that out itself by sending data across some interconnect.要访问（由另一个线程）在另一个内核的内存中分配的数据，软件必须通过在某个互连上发送数据来对其进行分类。 This is much harder to code for until learned, but the results can be impressively fast.在学习之前，这很难编码，但结果可能非常快。 It took Intel about 10 years to match the Cell processor for raw maths speed.英特尔花了大约 10 年的时间才在原始数学速度上与 Cell 处理器相匹敌。