为什么代码的位置会影响C ++的性能？

Question

I am running a test performance, and found out that changing the order of the code makes it faster without compromising the result. 我正在运行测试性能，并发现更改代码的顺序使其更快，而不会影响结果。

Performance is measured by time execution using chrono library. 使用计时库通过时间执行来衡量性能。

vector< vector<float> > U(matrix_size, vector<float>(matrix_size,14));
vector< vector<float> > L(matrix_size, vector<float>(matrix_size,12));
vector< vector<float> > matrix_positive_definite(matrix_size, vector<float>(matrix_size,23));

for (i = 0; i < matrix_size; ++i) {         
   for(j= 0; j < matrix_size; ++j){
//Part II : ________________________________________
    float sum2=0;               
    for(k= 0; k <= (i-1); ++k){
      float sum2_temp=L[i][k]*U[k][j];
      sum2+=sum2_temp;
    }
//Part I : _____________________________________________
    float sum1=0;       
    for(k= 0; k <= (j-1); ++k){
      float sum1_temp=L[i][k]*U[k][j];
      sum1+=sum1_temp;
    }           
//__________________________________________
    if(i>j){
      L[i][j]=(matrix_positive_definite[i][j]-sum1)/U[j][j]; 
    }
    else{
       U[i][j]=matrix_positive_definite[i][j]-sum2;
    }   
   }
}

I compile with g++ -O3 (GCC 7.4.0 in Intel i5/Win10). 我用g++ -O3 （Intel i5 / Win10中的GCC 7.4.0）编译。 I changed the order of Part I & Part II and got faster result if Part II executed before Part I. What's going on? 如果第二部分在第一部分之前执行，我改变了第一部分和第二部分的顺序并得到了更快的结果。发生了什么？

This is the link to the whole program. 这是整个计划的链接。

Answer 1

I would try running both versions with perf stat -d <app> and see where the difference of performance counters is. 我会尝试使用perf stat -d <app>运行这两个版本，并查看性能计数器的不同之处。

When benchmarking you may like to fix the CPU frequency, so it doesn't affect your scores. 在进行基准测试时，您可能希望修复CPU频率，因此不会影响您的分数。

Aligning loops on a 32-byte boundary often increases performance by 8-30%. 在32字节边界上对齐循环通常会使性能提高8-30％。 See Causes of Performance Instability due to Code Placement in X86 - Zia Ansari, Intel for more details. 有关详细信息，请参阅X86中的代码放置导致性能不稳定的原因 - Zia Ansari，Intel 。

Try compiling your code with -O3 -falign-loops=32 -falign-functions=32 -march=native -mtune=native . 尝试使用-O3 -falign-loops=32 -falign-functions=32 -march=native -mtune=native编译代码。

Answer 2

Running perf stat -ddd while playing around with the provided program shows that the major difference between the two versions stands mainly in the prefetch. 在使用提供的程序播放时运行perf stat -ddd表明两个版本之间的主要区别主要在于预取。

part II -> part I   and   part I -> part II (original program)
   73,069,502      L1-dcache-prefetch-misses

part II -> part I   and   part II -> part I (only the efficient version)
   31,719,117      L1-dcache-prefetch-misses

part I -> part II   and   part I -> part II (only the less efficient version)
  114,520,949      L1-dcache-prefetch-misses

nb: according to the compiler explorer, part II -> part I is very similar to part I -> part II . nb：根据编译器资源管理器， part II -> part I非常类似于第一part I -> part II 。

I guess that, on the first iterations over i , part II does almost nothing, but iterations over j make part I access U[k][j] according to a pattern that will ease prefetch for the next iterations over i . 我猜想，在i上的第一次迭代中， part II几乎没有任何作用，但是j迭代使得第一part I根据一种模式访问U[k][j] ，该模式将简化对i的下一次迭代的预取。

Answer 3

The faster version is similar to the performance you get when you move the loops inside the if (i > j) . 更快的版本类似于在if (i > j)内移动循环时获得的性能。

if (i > j) {
    float sum1 = 0;
    for (std::size_t k = 0; k < j; ++k){
        sum1 += L_series[i][k] * U_series[k][j];
    }
    L_parallel[i][j] = matrix_positive_definite[i][j] - sum1;
        L[i][j] /= U[j][j];
}
if (i <= j) {
    float sum2 = 0;
    for (std::size_t k = 0; k < i; ++k){
        sum2 += L_series[i][k] * U_series[k][j];
    }
    U_parallel[i][j] = matrix_positive_definite[i][j] - sum2;
}

So i would assume in one case the compiler is able to do that transformation itself. 所以我假设在一种情况下编译器能够自己进行转换。 It only happens at -O3 for me. 它只发生在-O3对我来说。 (1950X, msys2/GCC 8.3.0, Win10) （1950X，msys2 / GCC 8.3.0，Win10）

I don't know which optimization this is exactly and what conditions have to be met for it to apply. 我不知道这究竟是哪种优化以及必须满足哪些条件才能应用。 It's none of the options explicitly listed for -O3 ( -O2 + all of them is not enough). 它没有为-O3明确列出的选项（ -O2 +所有这些都不够）。 Apparently it already doesn't do it when std::size_t instead of int is used for the loop counters. 显然，当std::size_t而不是int用于循环计数器时，它已经不会这样做了。

为什么代码的位置会影响C ++的性能？

问题描述

3 个解决方案

解决方案1
5 2019-05-25 20:58:32

解决方案2
2 2019-05-26 10:40:25

解决方案3
1 2019-05-25 23:46:11

为什么代码的位置会影响C ++的性能？

问题描述

3 个解决方案

解决方案1 5 2019-05-25 20:58:32

解决方案2 2 2019-05-26 10:40:25

解决方案3 1 2019-05-25 23:46:11

解决方案1
5 2019-05-25 20:58:32

解决方案2
2 2019-05-26 10:40:25

解决方案3
1 2019-05-25 23:46:11