简体繁体 English

嵌套向量与连续数组的性能影响

[英]Performance Impact of Nested Vectors vs. Contiguous Arrays

原文 2017-08-18 03:02:47 1 2 c++/ arrays/ multidimensional-array/ vector

Have there been any reliable tests that clearly display the performance differences between accessing and writing to nested vectors versus C++'s built-in arrays? 有没有可靠的测试清楚地显示访问和写入嵌套向量与C ++的内置数组之间的性能差异？ I've heard that using nested (multi-dimensional) vectors typically have some performance overhead as compared to accessing elements in a single array (where all elements are stored in contiguous memory), but this just all seems to be hypothetical to me. 我听说使用嵌套（多维）向量通常比访问单个数组中的元素（其中所有元素都存储在连续的内存中）有一些性能开销，但这对我来说似乎只是假设。 I have yet to see any tests that actually show these differences. 我还没有看到任何实际显示这些差异的测试。 Are they significant? 它们重要吗？ I'm sure that it depends on the scenario, but as an inexperienced programmer, I'm not quite sure at what level these differences do become significant. 我确信它取决于场景，但作为一个缺乏经验的程序员，我不太确定这些差异在多大程度上变得显着。

2 个解决方案

It definitely depends on the scenario, to the extent that I don't think it's possible to answer in a general way which approach is fastest. 这绝对取决于场景，在某种程度上我认为不可能以一般方式回答哪种方法最快。 The fastest approach is going to be the one where the access patterns have the best data locality - which depends highly on the access pattern as well as how the structures are laid out in memory, which in the case of nested vectors is dependent on the allocator and probably varies quite a bit between compilers. 最快的方法是访问模式具有最佳数据局部性的方法 - 这在很大程度上取决于访问模式以及结构在内存中的布局方式，在嵌套向量的情况下依赖于分配器并且可能在编译器之间有很大差异。

I'd follow the general rule of optimization, which is to first write things in the most straightforward way and then attempt optimization when you can prove there is a bottleneck. 我遵循优化的一般规则，即首先以最直接的方式编写内容，然后在证明存在瓶颈时尝试优化。

Two things contribute to the runtime differences between nested and flattened arrays: 有两件事会导致嵌套和扁平数组之间的运行时差异：
Caching behaviour and indirection 缓存行为和间接

CPUs use a hierarchy of Caches to avoid accessing the RAM directly too frequently. CPU使用Caches层次结构来避免过于频繁地访问RAM。 This exploits the fact that most memory accesses are either contiguous or have a certain temporal locality , ie what was accessed recently will be accessed again soon. 这利用了大多数存储器访问是连续的或具有某个时间局部性的事实，即最近访问的内容将很快再次访问。
This means that if the innermost arrays of your nested array is rather large, you will notice small to no differences to a flat array if you iterate over the values in a contiguous fashion . 这意味着如果嵌套数组的最内层数组相当大，那么如果以连续方式迭代值 ，则会注意到平面数组的差异很小甚至没有差异。 This means that when iterating over a range of values, for flat arrays, your innermost loop should iterate over consecutive elements, for nested arrays, your innermost loop should iterate over the innermost array. 这意味着当迭代一系列值时，对于平面数组，最内层循环应迭代连续元素，对于嵌套数组，最内层循环应迭代最内层数组。
If however your access patterns are random, the most important difference in timing are indirections: 但是，如果您的访问模式是随机的，则时间上最重要的差异是间接：
For a flat array, you use something like A[(Z * M + Y) * N + X] , so you do 4 arithmetic operations and then a memory access. 对于平面数组，使用类似A[(Z * M + Y) * N + X] ，因此您可以执行4次算术运算，然后进行内存访问。
For a nested array, you use A[Z][Y][X] , so there are actually three interdependent memory accesses: You need to know A[Z] before you can access A[Z][Y] and so on. 对于嵌套数组，使用A[Z][Y][X] ，因此实际上存在三个相互依赖的内存访问：在访问A[Z][Y]之前需要知道A[Z] A[Z][Y] ，依此类推。 Because of the superscalar architecture of modern CPUs, operations that can be executed in parallel are especially efficient, interdependent operations not so much. 由于现代CPU的超标量体系结构，可以并行执行的操作特别有效，相互依赖的操作不是那么多。 So you have some arithmetic operations and a memory load on the one side and three interdependent loads on the other side, which is significantly slower. 所以你有一些算术运算和一侧的内存负载和另一侧的三个相互依赖的负载，这明显更慢。 It might be possible that for nested arrays, the contents of A and also A[Z] for some values of Z can be found in the cache hierarchy, but if your nested array is sufficiently large, it will never fit completely into the cache, thus leading to multiple cache misses and memory loads (nested) instead of just a single cache miss and load (flat) for a single random access into the array. 这可能是可能的，嵌套数组，内容A也A[Z]对于一些值Z可以在高速缓存层次结构发现，但如果你的嵌套数组足够大，它永远不会完全放入高速缓存，因此导致多个高速缓存未命中和内存加载（嵌套），而不是单个高速缓存未命中和加载（平坦），用于对阵列的单个随机访问。

Also see his question , especially the shorter answers below for a more detailed discussion of caching (my answer) and indirection (Peter's answer), which also provides an example where there are no noticeable differences between nested and flat arrays (after fixing the indexing bug of course ;) ) 另请参阅他的问题，尤其是下面的简短答案，以便更详细地讨论缓存（我的答案）和间接（Peter的答案），这也提供了一个示例，其中嵌套和平面数组之间没有明显的差异（在修复索引错误之后）当然 ;））

So if you want to know if there are significant runtime differences between them, my answer would be: 因此，如果您想知道它们之间是否存在重大的运行时差异，我的答案是：

If you do random access, you will definitely notice the multiple indirections, thus leading to a larger runtime for nested arrays. 如果您进行随机访问，您肯定会注意到多个间接，从而导致嵌套数组的运行时间更长。
If you do contiguous access and use the correct ordering of the loops (innermost loop = innermost array/innermost index for flat array) and your innermost dimension of the multi-dimensional array is large enough, then the difference will be negligable, since the compiler will be able to move all indirections out of the innermost loop. 如果你进行连续访问并使用正确的循环排序（最内层循环=最内层数组/平面数组最里面的索引）并且你的多维数组的最内层维度足够大，那么差异将是可忽略的，因为编译器将能够将所有间接移出最内层的循环。