简体繁体 English

C ++指针向量如何影响性能？

[英]C++ How does a vector of pointers affect performance?

原文 2017-04-19 22:48:21 7 1 c++/ performance/ pointers/ vector/ stl

I am wondering how a std::vector of pointers to objects affects the performance of a program, vs. using a std::vector directly containing the objects. 我想知道指向对象的指针的std :: vector如何影响程序的性能，与使用直接包含对象的std :: vector相比。 Specifically I am referring to the speed of the program. 具体来说，我指的是程序的速度。

I was taught to use the std::vector over other STL such as the std::list for it's speed, due to all of it's data being stored contiguously in memory, rather than being fragmented. 我被教导要在其他STL（例如std :: list）上使用std :: vector来提高速度，因为它的所有数据都连续存储在内存中，而不是碎片化。 This meant that iterating over the elements was fast, however my thinking is that if my vector contains pointers to the objects, then the objects can still be stored anywhere in memory and only the pointers are stored contiguously. 这意味着在元素上进行迭代非常快，但是我的想法是，如果我的向量包含指向对象的指针，则对象仍可以存储在内存中的任何位置，而仅指针是连续存储的。 I am wondering how this would affect the performance of a program when it comes to iterating over the vector and accessing the objects. 我想知道当遍历向量和访问对象时，这将如何影响程序的性能。

My current project design uses a vector of pointers so that I can take advantage of virtual functions however i'm unsure whether this is worth the speed hit i may encounter when my vector becomes very large. 我当前的项目设计使用指针向量，这样我就可以利用虚函数了，但是我不确定这是否值得当向量变大时可能遇到的速度下降。 Thanks for your help! 谢谢你的帮助！

1 个解决方案

If you need the polymorphism, as people have said, you should store the pointers to the base. 正如人们所说，如果需要多态性，则应该存储指向基址的指针。 If, later, you decide this code is hot and needs to optimise it's cpu cache usage you can do that say by making the objects fit cleanly in cache lanes and/or with a custom allocator to ensure code locality of the dereferenced data. 如果稍后您确定此代码很热并且需要优化它的cpu缓存使用率，则可以通过使对象完全适合缓存通道和/或使用自定义分配器来确保所引用数据的代码局部性来做到这一点。

Slicing is when you store objects by value Base and copy construct or allocate into them a Derived, Derived will be Sliced, the copy constructor or allocator only takes a Base and will ignore any data in Derived, there isn't enough space allocated in Base to take the full size of Derived. 切片是当您按Base值存储对象并复制构造或在其中分配Derived时，Derived将被切片，副本构造函数或分配器仅使用Base并将忽略Derived中的任何数据，在Base中没有足够的空间分配取“派生”的完整大小。 ie if Base is 8 bytes and Derived is 16, there's only enough room for Base's 8 bytes in the destination value even if you provided a copy constructor/allocator that explicitly took a Dervived. 也就是说，如果Base为8个字节，Derived为16，即使您提供了明确采用Dervived的副本构造函数/分配器，目标值中Base的8个字节也只有足够的空间。

I should say it's really not worth thinking about data cache coherence if you're using virtualisation heavily that wont be elided by the optimiser. 我应该说，如果您大量使用优化程序不会忽略的虚拟化，那么真的不值得考虑数据缓存的一致性。 An instruction cache miss is far more devestating than a data cache miss and virtualisation can cause instruction cache misses because it has to look up the vtable pointer before loading the function into the instruction cache and so can't preemptively load them. 指令高速缓存未命中比数据高速缓存未命中更具破坏性，虚拟化会导致指令高速缓存未命中，因为虚拟化必须在将函数加载到指令高速缓存中之前查找vtable指针，因此不能抢先加载它们。

CPU's tend to like to preload as much data as they can into caches, if you load an address an entire cache lane (~64 bytes) will be loaded into a cache lane and often it will also load the cache lane before and after which is why people are so keen on data locality. CPU倾向于喜欢将尽可能多的数据预加载到高速缓存中，如果您加载一个地址，则整个高速缓存通道（〜64字节）将被加载到高速缓存通道中，并且通常还会在此之前和之后加载高速缓存通道。人们为何如此热衷于数据本地化。

So in your vector of pointers scenario when you load the first pointer you'll get lots of pointers in the cache at once loading through the pointer will trigger a cache miss and load up the data around that object, if your actual particles are 16 bytes and local to each other, you wont lose much beyond this. 因此，在指针向量的情况下，当您加载第一个指针时，如果实际粒子为16个字节，则一次通过指针加载将在缓存中获得很多指针，这将触发缓存未命中并加载该对象周围的数据。和彼此本地化，您将不会因此损失太多。 if they're all over the heap and massive you will get very cache churny on each iteration and relatively ok while working on a particle. 如果它们遍及整个堆，那么每次迭代都将获得非常高速的缓存，并且在处理粒子时相对可以。

Traditionally, particle systems tend to be very hot and like to pack data tightly, it's common to see 16 byte Plain Old Data particle systems which you iterate linearly over with very predicatble branching. 传统上，粒子系统往往非常热，并且喜欢紧紧打包数据，通常会看到16字节的Plain Old Data粒子系统，您可以通过非常可预测的分支对它进行线性迭代。 meaning you can generally rely on 4 particles per cache lane and have the prefetcher stay well ahead of your code. 这意味着您通常可以在每个高速缓存通道上依靠4个粒子，并使预取器保持在代码之前。

I also should say that cpu caches are cpu dependant and i'm focusing on intel x86. 我还应该说cpu缓存是cpu依赖的，我的重点是intel x86。 Arm for example tends to be quite a bit behind intel & the pipeline is less complex, the prefetcher less capable and so cache misses can be less devastating. 例如，arm往往比intel落后很多，而且管道也不太复杂，预取器的能力也较差，因此缓存丢失的破坏性较小。