简体   繁体   中英

Why does position of code affect performance in C++?

I am running a test performance, and found out that changing the order of the code makes it faster without compromising the result.

Performance is measured by time execution using chrono library.

vector< vector<float> > U(matrix_size, vector<float>(matrix_size,14));
vector< vector<float> > L(matrix_size, vector<float>(matrix_size,12));
vector< vector<float> > matrix_positive_definite(matrix_size, vector<float>(matrix_size,23));

for (i = 0; i < matrix_size; ++i) {         
   for(j= 0; j < matrix_size; ++j){
//Part II : ________________________________________
    float sum2=0;               
    for(k= 0; k <= (i-1); ++k){
      float sum2_temp=L[i][k]*U[k][j];
      sum2+=sum2_temp;
    }
//Part I : _____________________________________________
    float sum1=0;       
    for(k= 0; k <= (j-1); ++k){
      float sum1_temp=L[i][k]*U[k][j];
      sum1+=sum1_temp;
    }           
//__________________________________________
    if(i>j){
      L[i][j]=(matrix_positive_definite[i][j]-sum1)/U[j][j]; 
    }
    else{
       U[i][j]=matrix_positive_definite[i][j]-sum2;
    }   
   }
}

I compile with g++ -O3 (GCC 7.4.0 in Intel i5/Win10). I changed the order of Part I & Part II and got faster result if Part II executed before Part I. What's going on?

This is the link to the whole program.

I would try running both versions with perf stat -d <app> and see where the difference of performance counters is.

When benchmarking you may like to fix the CPU frequency, so it doesn't affect your scores.


Aligning loops on a 32-byte boundary often increases performance by 8-30%. See Causes of Performance Instability due to Code Placement in X86 - Zia Ansari, Intel for more details.

Try compiling your code with -O3 -falign-loops=32 -falign-functions=32 -march=native -mtune=native .

Running perf stat -ddd while playing around with the provided program shows that the major difference between the two versions stands mainly in the prefetch.

part II -> part I   and   part I -> part II (original program)
   73,069,502      L1-dcache-prefetch-misses

part II -> part I   and   part II -> part I (only the efficient version)
   31,719,117      L1-dcache-prefetch-misses

part I -> part II   and   part I -> part II (only the less efficient version)
  114,520,949      L1-dcache-prefetch-misses

nb: according to the compiler explorer, part II -> part I is very similar to part I -> part II .

I guess that, on the first iterations over i , part II does almost nothing, but iterations over j make part I access U[k][j] according to a pattern that will ease prefetch for the next iterations over i .

The faster version is similar to the performance you get when you move the loops inside the if (i > j) .

if (i > j) {
    float sum1 = 0;
    for (std::size_t k = 0; k < j; ++k){
        sum1 += L_series[i][k] * U_series[k][j];
    }
    L_parallel[i][j] = matrix_positive_definite[i][j] - sum1;
        L[i][j] /= U[j][j];
}
if (i <= j) {
    float sum2 = 0;
    for (std::size_t k = 0; k < i; ++k){
        sum2 += L_series[i][k] * U_series[k][j];
    }
    U_parallel[i][j] = matrix_positive_definite[i][j] - sum2;
}

So i would assume in one case the compiler is able to do that transformation itself. It only happens at -O3 for me. (1950X, msys2/GCC 8.3.0, Win10)

I don't know which optimization this is exactly and what conditions have to be met for it to apply. It's none of the options explicitly listed for -O3 ( -O2 + all of them is not enough). Apparently it already doesn't do it when std::size_t instead of int is used for the loop counters.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM