I am running a test performance, and found out that changing the order of the code makes it faster without compromising the result.
Performance is measured by time execution using chrono library.
vector< vector<float> > U(matrix_size, vector<float>(matrix_size,14));
vector< vector<float> > L(matrix_size, vector<float>(matrix_size,12));
vector< vector<float> > matrix_positive_definite(matrix_size, vector<float>(matrix_size,23));
for (i = 0; i < matrix_size; ++i) {
for(j= 0; j < matrix_size; ++j){
//Part II : ________________________________________
float sum2=0;
for(k= 0; k <= (i-1); ++k){
float sum2_temp=L[i][k]*U[k][j];
sum2+=sum2_temp;
}
//Part I : _____________________________________________
float sum1=0;
for(k= 0; k <= (j-1); ++k){
float sum1_temp=L[i][k]*U[k][j];
sum1+=sum1_temp;
}
//__________________________________________
if(i>j){
L[i][j]=(matrix_positive_definite[i][j]-sum1)/U[j][j];
}
else{
U[i][j]=matrix_positive_definite[i][j]-sum2;
}
}
}
I compile with g++ -O3
(GCC 7.4.0 in Intel i5/Win10). I changed the order of Part I & Part II and got faster result if Part II executed before Part I. What's going on?
This is the link to the whole program.
I would try running both versions with perf stat -d <app>
and see where the difference of performance counters is.
When benchmarking you may like to fix the CPU frequency, so it doesn't affect your scores.
Aligning loops on a 32-byte boundary often increases performance by 8-30%. See Causes of Performance Instability due to Code Placement in X86 - Zia Ansari, Intel for more details.
Try compiling your code with -O3 -falign-loops=32 -falign-functions=32 -march=native -mtune=native
.
Running perf stat -ddd
while playing around with the provided program shows that the major difference between the two versions stands mainly in the prefetch.
part II -> part I and part I -> part II (original program)
73,069,502 L1-dcache-prefetch-misses
part II -> part I and part II -> part I (only the efficient version)
31,719,117 L1-dcache-prefetch-misses
part I -> part II and part I -> part II (only the less efficient version)
114,520,949 L1-dcache-prefetch-misses
nb: according to the compiler explorer, part II -> part I
is very similar to part I -> part II
.
I guess that, on the first iterations over i
, part II
does almost nothing, but iterations over j
make part I
access U[k][j]
according to a pattern that will ease prefetch for the next iterations over i
.
The faster version is similar to the performance you get when you move the loops inside the if (i > j)
.
if (i > j) {
float sum1 = 0;
for (std::size_t k = 0; k < j; ++k){
sum1 += L_series[i][k] * U_series[k][j];
}
L_parallel[i][j] = matrix_positive_definite[i][j] - sum1;
L[i][j] /= U[j][j];
}
if (i <= j) {
float sum2 = 0;
for (std::size_t k = 0; k < i; ++k){
sum2 += L_series[i][k] * U_series[k][j];
}
U_parallel[i][j] = matrix_positive_definite[i][j] - sum2;
}
So i would assume in one case the compiler is able to do that transformation itself. It only happens at -O3
for me. (1950X, msys2/GCC 8.3.0, Win10)
I don't know which optimization this is exactly and what conditions have to be met for it to apply. It's none of the options explicitly listed for -O3 ( -O2
+ all of them is not enough). Apparently it already doesn't do it when std::size_t
instead of int
is used for the loop counters.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.