Efficiency of STL algorithms with fixed size arrays

Question

In general, I assume that the STL implementation of any algorithm is at least as efficient as anything I can come up with (with the additional benefit of being error free). However, I came to wonder whether the STL's focus on iterators might be harmful in some situations.

Lets assume I want to calculate the inner product of two fixed size arrays. My naive implementation would look like this:

std::array<double, 100000> v1;
std::array<double, 100000> v2;
//fill with arbitrary numbers

double sum = 0.0;
for (size_t i = 0; i < v1.size(); ++i) {
    sum += v1[i] * v2[i];
}

As the number of iterations and the memory layout are known during compile time and all operations can directly be mapped to native processor instructions, the compiler should easily be able to generate the "optimal" machine code from this (loop unrolling, vectorization / FMA instructions ...).

The STL version

double sum = std::inner_product(cbegin(v1), cend(v1), cbegin(v2), 0.0);

on the other hand adds some additional indirections and even if everything is inlined, the compiler still has to deduce that it is working on a continuous memory region and where this region lies. While this is certainly possible in principle, I wonder, whether the typical c++ compiler will actually do it.

So my question is : Do you think, there can be a performance benefit of implementing standard algorithms that operate on fixed size arrays on my own, or will the STL Version always outperform a manual implementation?

Answer 1

As suggested I did some measurements and

for the code below
compiled with VS2013 for x64 in release mode
executed on a Win8.1 Machine with an i7-2640M,

the algorithm version is consistently slower by about 20% (15.6-15.7s vs 12.9-13.1s). The relative difference, also stays roughly constant over two orders of magnitude for N and REPS .

So I guess the answer is: Using standard library algorithms CAN hurt performance.

It would still be interesting, if this is a general problem or if it is specific to my platform, compiler and benchmark. You are welcome to post your own resutls or comment on the benchmark.

#include <iostream>
#include <numeric>
#include <array>
#include <chrono>
#include <cstdlib>

#define USE_STD_ALGORITHM

using namespace std;
using namespace std::chrono;

static const size_t N = 10000000; //size of the arrays
static const size_t REPS = 1000; //number of repitions

array<double, N> a1;
array<double, N> a2;

int main(){
    srand(10);
    for (size_t i = 0; i < N; ++i) {
        a1[i] = static_cast<double>(rand())*0.01;
        a2[i] = static_cast<double>(rand())*0.01;
    }

    double res = 0.0;
    auto start=high_resolution_clock::now();
    for (size_t z = 0; z < REPS; z++) {     
        #ifdef USE_STD_ALGORITHM
            res = std::inner_product(a1.begin(), a1.end(), a2.begin(), res);        
        #else           
            for (size_t t = 0; t < N; ++t)  {
                res+= a1[t] * a2[t];
            }
        #endif        
    }
    auto end = high_resolution_clock::now();

    std::cout << res << "  "; // <-- necessary, so that loop isn't optimized away
    std::cout << duration_cast<milliseconds>(end - start).count() <<" ms"<< std::endl;

}
/* 
 * Update: Results (ubuntu 14.04 , haswell)
 * STL: algorithm
 * g++-4.8-2    -march=native -std=c++11 -O3 main.cpp               : 1.15287e+24  3551 ms
 * g++-4.8-2    -march=native -std=c++11 -ffast-math -O3 main.cpp   : 1.15287e+24  3567 ms
 * clang++-3.5  -march=native -std=c++11 -O3 main.cpp               : 1.15287e+24  9378 ms
 * clang++-3.5  -march=native -std=c++11 -ffast-math -O3 main.cpp   : 1.15287e+24  8505 ms
 *
 * loop:
 * g++-4.8-2    -march=native -std=c++11 -O3 main.cpp               : 1.15287e+24  3543 ms
 * g++-4.8-2    -march=native -std=c++11 -ffast-math -O3 main.cpp   : 1.15287e+24  3551 ms
 * clang++-3.5  -march=native -std=c++11 -O3 main.cpp               : 1.15287e+24  9613 ms
 * clang++-3.5  -march=native -std=c++11 -ffast-math -O3 main.cpp   : 1.15287e+24  8642 ms
 */

EDIT:
I did a quick check with g++-4.9.2 and clang++-3.5 with O3 and std=c++11 on a fedora 21 Virtual Box VM on the same machine and apparently those compilers don't have the same problem (the time is almost the same for both versions). However, gcc's version is about twice as fast as clang's (7.5s vs 14s).

Efficiency of STL algorithms with fixed size arrays

Question

1 answers

solution1
1 ACCPTED 2015-04-13 16:42:36

Efficiency of STL algorithms with fixed size arrays

Question

1 answers

solution1 1 ACCPTED 2015-04-13 16:42:36

solution1
1 ACCPTED 2015-04-13 16:42:36