简体   繁体   English

C ++性能std :: array vs std :: vector

[英]C++ performance std::array vs std::vector

Good evening. 晚上好。

I know C-style arrays or std::array aren't faster than vectors. 我知道C风格的数组或std :: array并不比矢量快。 I use vectors all the time (and I use them well). 我一直使用矢量(我使用它们很好)。 However, I have some situation in which the use of std::array performs better than with std::vector, and I have no clue why (tested with clang 7.0 and gcc 8.2). 但是,我有一些情况,使用std :: array比使用std :: vector更好,我不知道为什么(用clang 7.0和gcc 8.2测试)。

Let me share a simple code: 让我分享一个简单的代码:

#include <vector>
#include <array>

// some size constant
const size_t N = 100;

// some vectors and arrays
using vec = std::vector<double>;
using arr = std::array<double,3>;
// arrays are constructed faster here due to known size, but it is irrelevant
const vec v1 {1.0,-1.0,1.0};
const vec v2 {1.0,2.0,1.0};
const arr a1 {1.0,-1.0,1.0};
const arr a2 {1.0,2.0,1.0};

// vector to store combinations of vectors or arrays
std::vector<double> glob(N,0.0);

So far, so good. 到现在为止还挺好。 The above code which initializes the variables is not included in the benchmark. 初始化变量的上述代码不包含在基准测试中。 Now, let's write a function to combine elements ( double ) of v1 and v2 , or of a1 and a2 : 现在,让我们编写一个函数来组合v1v2a1a2元素( double ):

// some combination
auto comb(const double m, const double f)
{
  return m + f;
}

And the benchmark functions: 基准功能:

void assemble_vec()
{
    for (size_t i=0; i<N-2; ++i)
    {
        glob[i] += comb(v1[0],v2[0]);
        glob[i+1] += comb(v1[1],v2[1]);
        glob[i+2] += comb(v1[2],v2[2]);
    }  
}

void assemble_arr()
{
    for (size_t i=0; i<N-2; ++i)
    {
        glob[i] += comb(a1[0],a2[0]);
        glob[i+1] += comb(a1[1],a2[1]);
        glob[i+2] += comb(a1[2],a2[2]);
    }  
}

I've tried this with clang 7.0 and gcc 8.2. 我用clang 7.0和gcc 8.2试过这个。 In both cases, the array version goes almost twice as fast as the vector version. 在这两种情况下,阵列版本的速度几乎是矢量版本的两倍。

Does anyone know why? 有谁知道为什么? Thanks! 谢谢!

GCC (and probably Clang) are optimizing out the Arrays, but not the Vectors GCC(可能还有Clang)正在优化数组,但不是向量

Your base assumption that arrays are necessarily slower than vectors is incorrect. 数组必然比向量慢的基本假设是不正确的。 Because vectors require their data to be stored in allocated memory (which with a default allocator uses dynamic memory), the values that need to be used have to be stored in heap memory and accessed repeatedly during the execution of this program. 因为向量要求将其数据存储在已分配的内存中(默认分配器使用动态内存),所以需要使用的值必须存储在堆内存中,并在执行此程序时重复访问。 Conversely, the values used by the array can be optimized out entirely and simply directly referenced in the assembly of the program. 相反,数组使用的值可以完全优化,并在程序的程序集中直接引用。

Below is what GCC spit out as assembly for the assemble_vec and assemble_arr functions once optimizations were turned on: 下面是GCC在打开优化后吐出为assemble_vecassemble_arr函数的assemble_vec集:

[-snip-]
//==============
//Vector Version
//==============
assemble_vec():
        mov     rax, QWORD PTR glob[rip]
        mov     rcx, QWORD PTR v2[rip]
        mov     rdx, QWORD PTR v1[rip]
        movsd   xmm1, QWORD PTR [rax+8]
        movsd   xmm0, QWORD PTR [rax]
        lea     rsi, [rax+784]
.L23:
        movsd   xmm2, QWORD PTR [rcx]
        addsd   xmm2, QWORD PTR [rdx]
        add     rax, 8
        addsd   xmm0, xmm2
        movsd   QWORD PTR [rax-8], xmm0
        movsd   xmm0, QWORD PTR [rcx+8]
        addsd   xmm0, QWORD PTR [rdx+8]
        addsd   xmm0, xmm1
        movsd   QWORD PTR [rax], xmm0
        movsd   xmm1, QWORD PTR [rcx+16]
        addsd   xmm1, QWORD PTR [rdx+16]
        addsd   xmm1, QWORD PTR [rax+8]
        movsd   QWORD PTR [rax+8], xmm1
        cmp     rax, rsi
        jne     .L23
        ret

//=============
//Array Version
//=============
assemble_arr():
        mov     rax, QWORD PTR glob[rip]
        movsd   xmm2, QWORD PTR .LC1[rip]
        movsd   xmm3, QWORD PTR .LC2[rip]
        movsd   xmm1, QWORD PTR [rax+8]
        movsd   xmm0, QWORD PTR [rax]
        lea     rdx, [rax+784]
.L26:
        addsd   xmm1, xmm3
        addsd   xmm0, xmm2
        add     rax, 8
        movsd   QWORD PTR [rax-8], xmm0
        movapd  xmm0, xmm1
        movsd   QWORD PTR [rax], xmm1
        movsd   xmm1, QWORD PTR [rax+8]
        addsd   xmm1, xmm2
        movsd   QWORD PTR [rax+8], xmm1
        cmp     rax, rdx
        jne     .L26
        ret
[-snip-]

There are several differences between these sections of code, but the critical difference is after the .L23 and .L26 labels respectively, where for the vector version, the numbers are being added together through less efficient opcodes, as compared to the array version, which is using (more) SSE instructions. 这些代码部分之间存在一些差异,但关键区别在于分别在.L23.L26标签之后,对于矢量版本,与阵列版本相比,数字通过效率较低的操作码加在一起。正在使用(更多)SSE指令。 The vector version also involves more memory lookups compared to the array version. 与阵列版本相比,矢量版本还涉及更多的内存查找。 These factors in combination with each other is going to result in code that executes faster for the std::array version of the code than it will for the std::vector version. 这些因素相互结合将导致代码对std::array版本的代码执行速度比std::vector版本快。

C++ aliasing rules don't let the compiler prove that glob[i] += stuff doesn't modify one of the elements of const vec v1 {1.0,-1.0,1.0}; C ++别名规则不允许编译器证明glob[i] += stuff不会修改const vec v1 {1.0,-1.0,1.0};其中一个元素const vec v1 {1.0,-1.0,1.0}; or v2 . v2

const on a std::vector means the "control block" pointers can be assumed to not be modified after it's constructed, but the memory is still dynamically allocated an all the compiler knows is that it effectively has a const double * in static storage. std::vector上的const意味着可以假设“控制块”指针在构造之后不被修改,但是内存仍然是动态分配的,所有编译器都知道它在静态存储中实际上有一个const double *

Nothing in the std::vector implementation lets the compiler rule out some other non-const pointer pointing into that storage. std::vector实现中没有任何内容允许编译器排除指向该存储的其他一些non-const指针。 For example, the double *data in the control block of glob . 例如, glob的控制块中的double *data

C++ doesn't provide a way for library implementers to give the compiler the information that the storage for different std::vector s doesn't overlap. C ++没有为库实现者提供一种方法,为编译器提供不同std::vector s的存储不重叠的信息。 They can't use __restrict (even on compilers that support that extension) because that could break programs that take the address of a vector element. 他们不能使用__restrict (甚至在支持该扩展的编译器上),因为这可能会破坏带有vector元素地址的程序。 See the C99 documentation for restrict . 有关restrict请参阅C99文档


But with const arr a1 {1.0,-1.0,1.0}; 但是使用const arr a1 {1.0,-1.0,1.0}; and a2 , the doubles themselves can go in read-only static storage, and the compiler knows this. a2 ,双打本身可以进入只读静态存储,编译器知道这一点。 Therefore it can evaluate comb(a1[0],a2[0]); 因此它可以评估comb(a1[0],a2[0]); and so on at compile time . 等等在编译时 In @Xirema's answer, you can see the asm output loads constants .LC1 and .LC2 . 在@Xirema的答案中,您可以看到asm输出加载常量.LC1.LC2 (Only two constants because both a1[0]+a2[0] and a1[2]+a2[2] are 1.0+1.0 . The loop body uses xmm2 as a source operand for addsd twice, and the other constant once.) (只有两个常数,因为这两个a1[0]+a2[0]a1[2]+a2[2]1.0+1.0 。循环体使用xmm2作为源操作数为addsd两次,另一恒定一次)。


But couldn't the compiler still do the sums once outside the loop at runtime? 但是,在运行时循环外,编译器仍然无法执行求和操作吗?

No, again because of potential aliasing. 不,再次因为潜在的混叠。 It doesn't know that stores into glob[i+0..3] won't modify the contents of v1[0..2] , so it reloads from v1 and v2 every time through the loop after the store into glob . 它不知道存储到glob[i+0..3]中的存储不会修改v1[0..2]的内容,因此每次通过循环后它都会从v1和v2重新加载到glob

(It doesn't have to reload the vector<> control block pointers, though, because type-based strict aliasing rules let it assume that storing a double doesn't modify a double* .) (但它不必重新加载vector<>控制块指针,因为基于类型的严格别名规则让它假设存储double不会修改double* 。)

The compiler could have checked that glob.data() + 0 .. N-3 didn't overlap with either of v1/v1.data() + 0 .. 2 , and made a different version of the loop for that case, hoisting the three comb() results out of the loop. 编译器可以检查glob.data() + 0 .. N-3没有与v1/v1.data() + 0 .. 2任何一个重叠,并为该情况制作了不同版本的循环,将三个comb()结果提升出循环。

This is a useful optimization that some compilers do when auto-vectorizing if they can't prove lack of aliasing ; 这是一些有用的优化,一些编译器在自动矢量化时会做,如果它们不能证明缺少别名 ; it's clearly a missed optimization in your case that gcc doesn't check for overlap because it would make the function run much faster. 在你的情况下,gcc不会检查重叠,这显然是一个错过的优化,因为它会使函数运行得更快。 But the question is whether the compiler could reasonably guess that it was worth emitting asm that checks at runtime for overlap, and has 2 different versions of the same loop. 但问题是编译器是否可以合理地猜测在运行时检查重叠是否值得发出asm,并且具有相同循环的2个不同版本。 With profile-guided optimization, it would know the loop is hot (runs many iterations), and would be worth spending extra time on. 通过配置文件引导优化,它会知道循环很热(运行多次迭代),并且值得花费额外的时间。 But without that, the compiler might not want to risk bloating the code too much. 但如果没有这个,编译器可能不希望冒太多风险。

ICC19 (Intel's compiler) in fact does do something like that here, but it's weird: if you look at the beginning of assemble_vec ( on the Godbolt compiler explorer ), it load the data pointer from glob , then adds 8 and subtracts the pointer again, producing a constant 8 . ICC19(英特尔的编译器)实际上确实在这里做了类似的事情,但它很奇怪:如果你看一下assemble_vec的开头( 在Godbolt编译器浏览器上 ),它从glob加载数据指针,然后加上8并再次减去指针,产生一个常数8 Then it branches at runtime on 8 > 784 (not taken) and then -8 < 784 (taken). 然后它在运行时分支8 > 784 (未采用)然后-8 < 784 (采用)。 It looks like this was supposed to be an overlap check, but it maybe used the same pointer twice instead of v1 and v2? 看起来这应该是重叠检查,但它可能使用相同的指针两次而不是v1和v2? ( 784 = 8*100 - 16 = sizeof(double)*N - 16 ) 784 = 8*100 - 16 = sizeof(double)*N - 16

Anyway, it ends up running the ..B2.19 loop that hoists all 3 comb() calculations, and interestingly does 2 iterations at once of the loop with 4 scalar loads and stores to glob[i+0..4] , and 6 addsd (scalar double) add instructions. 无论如何,它最终运行了..B2.19循环,它提升了所有3个comb()计算,有趣的是一次循环2次迭代,4个标量加载并存储到glob[i+0..4] ,并且6 addsd (标量双)添加指令。

Elsewhere in the function body, there's a vectorized version that uses 3x addpd (packed double), just storing / reloading 128-bit vectors that partially overlap. 在函数体的其他地方,有一个矢量化版本,它使用3x addpd (打包双addpd ),只存储/重新加载部分重叠的128位向量。 This will cause store-forwarding stalls, but out-of-order execution may be able to hide that. 这将导致存储转发停顿,但是无序执行可能能够隐藏它。 It's just really weird that it branches at runtime on a calculation that will produce the same result every time, and never uses that loop. 它非常奇怪,它在运行时分支计算,每次都会产生相同的结果,并且从不使用该循环。 Smells like a bug. 闻起来像个臭虫。


If glob[] had been a static array , you'd still have had a problem. 如果glob[]是一个静态数组 ,你仍然遇到了问题。 Because the compiler can't know that v1/v2.data() aren't pointing into that static array. 因为编译器无法知道v1/v2.data()没有指向那个静态数组。

I thought if you accessed it through double *__restrict g = &glob[0]; 我想如果你通过double *__restrict g = &glob[0];访问它double *__restrict g = &glob[0]; , there wouldn't have been a problem at all. ,根本不存在问题。 That will promise the compiler that g[i] += ... won't affect any values that you access through other pointers, like v1[0] . 这将保证编译器g[i] += ...不会影响您通过其他指针访问的任何值,如v1[0]

In practice, that does not enable hoisting of comb() for gcc, clang, or ICC -O3 . 实际上,这不能为gcc,clang或ICC -O3提升comb() But it does for MSVC. 但它确实适用于MSVC。 (I've read that MSVC doesn't do type-based strict aliasing optimizations, but it's not reloading glob.data() inside the loop so it has somehow figured out that storing a double won't modify a pointer. But MSVC does define the behaviour of *(int*)my_float for type-punning, unlike other C++ implementations.) (我已经读过MSVC没有进行基于类型的严格别名优化,但是它没有在循环中重新加载glob.data()所以它以某种方式弄清楚存储一个double不会修改指针。但MSVC会与其他C ++实现不同,定义*(int*)my_float的行为以进行类型惩罚。)

For testing, I put this on Godbolt 为了测试, 我把它放在Godbolt上

//__attribute__((noinline))
void assemble_vec()
{
     double *__restrict g = &glob[0];   // Helps MSVC, but not gcc/clang/ICC
    // std::vector<double> &g = glob;   // actually hurts ICC it seems?
    // #define g  glob                  // so use this as the alternative to __restrict
    for (size_t i=0; i<N-2; ++i)
    {
        g[i] += comb(v1[0],v2[0]);
        g[i+1] += comb(v1[1],v2[1]);
        g[i+2] += comb(v1[2],v2[2]);
    }  
}

We get this from MSVC outside the loop 我们从循环外的MSVC得到这个

    movsd   xmm2, QWORD PTR [rcx]       # v2[0]
    movsd   xmm3, QWORD PTR [rcx+8]
    movsd   xmm4, QWORD PTR [rcx+16]
    addsd   xmm2, QWORD PTR [rax]       # += v1[0]
    addsd   xmm3, QWORD PTR [rax+8]
    addsd   xmm4, QWORD PTR [rax+16]
    mov     eax, 98                             ; 00000062H

Then we get an efficient-looking loop. 然后我们得到一个有效的循环。

So this is a missed-optimization for gcc/clang/ICC. 所以这是对gcc / clang / ICC的错过优化。

I think the point is that you use too small storage size (six doubles), this allows the compiler, in the std::array case, to completely eliminate in RAM storing by placing values in the registers. 我认为关键是你使用的存储空间太小(六个双倍),这使得编译器在std::array情况下,通过在寄存器中放置值来完全消除RAM存储。 The compiler can store stack variables to registers if it more optimal. 如果更优化,编译器可以将堆栈变量存储到寄存器。 This decrease memory accesses by half (only writing to glob remains). 这减少了一半的内存访问(只写入glob仍然存在)。 In the case of a std::vector , the compiler cannot perform such an optimization since dynamic memory is used. std::vector的情况下,由于使用了动态内存,编译器无法执行这样的优化。 Try to use significantly larger sizes for a1, a2, v1, v2 尝试为a1, a2, v1, v2使用明显更大的尺寸

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM