Benchmarking adding elements to vector when size is known

Question

I have made a tiny benchmark for adding new elements to vector which I know its size.

Code:

struct foo{
    foo() = default;
    foo(double x, double y, double z) :x(x), y(y), z(y){

    }
    double x;
    double y;
    double z;
};

void resize_and_index(){
    std::vector<foo> bar(1000);
    for (auto& item : bar){
        item.x = 5;
        item.y = 5;
        item.z = 5;
    }
}

void reserve_and_push(){
    std::vector<foo> bar;
    bar.reserve(1000);
    for (size_t i = 0; i < 1000; i++)
    {
        bar.push_back(foo(5, 5, 5));
    }
}

void reserve_and_push_move(){
    std::vector<foo> bar;
    bar.reserve(1000);
    for (size_t i = 0; i < 1000; i++)
    {
        bar.push_back(std::move(foo(5, 5, 5)));
    }
}

void reserve_and_embalce(){
    std::vector<foo> bar;
    bar.reserve(1000);
    for (size_t i = 0; i < 1000; i++)
    {
        bar.emplace_back(5, 5, 5);
    }
}

I have then call each method 100000 times.

results:

resize_and_index: 176 mSec 
reserve_and_push: 560 mSec
reserve_and_push_move: 574 mSec 
reserve_and_embalce: 143 mSec

Calling code :

const size_t repeate = 100000;
auto start_time = clock();
for (size_t i = 0; i < repeate; i++)
{
    resize_and_index();
}
auto stop_time = clock();
std::cout << "resize_and_index: " << (stop_time - start_time) / double(CLOCKS_PER_SEC) * 1000 << " mSec" << std::endl;


start_time = clock();
for (size_t i = 0; i < repeate; i++)
{
    reserve_and_push();
}
stop_time = clock();
std::cout << "reserve_and_push: " << (stop_time - start_time) / double(CLOCKS_PER_SEC) * 1000 << " mSec" << std::endl;


start_time = clock();
for (size_t i = 0; i < repeate; i++)
{
    reserve_and_push_move();
}
stop_time = clock();
std::cout << "reserve_and_push_move: " << (stop_time - start_time) / double(CLOCKS_PER_SEC) * 1000 << " mSec" << std::endl;


start_time = clock();
for (size_t i = 0; i < repeate; i++)
{
    reserve_and_embalce();
}
stop_time = clock();
std::cout << "reserve_and_embalce: " << (stop_time - start_time) / double(CLOCKS_PER_SEC) * 1000 << " mSec" << std::endl;

My questions:

Why did I get these results? what make emplace_back superior to others?
Why does std::move make the performance slightly worse ?

Benchmarking conditions:

Compiler: VS.NET 2013 C++ compiler (/O2 Max speed Optimization)
OS : Windows 8
Processor: Intel Core i7-410U CPU @ 2.00 GHZ

Another Machine (By horstling ):

VS2013, Win7, Xeon 1241 @ 3.5 Ghz

resize_and_index: 144 mSec
reserve_and_push: 199 mSec
reserve_and_push_move: 201 mSec
reserve_and_embalce: 111 mSec

Answer 1

Why did I get these results? what make emplace_back superior to others?

You got these results because you benchmarked it and you had to get some results :).
Emplace back in this case is doing better because its directly creating/constructing the object at the memory location reserved by the vector. So, it does not have to first create an object (temp maybe) outside then copy/move it to the vector's reserved location thereby saving some overhead.

Why does std::move make the performance slightly worse ?

If you are asking why its more costly than emplace then it would be because it has to 'move' the object. In this case the move operation could have been very well reduced to copy. So, it must be the copy operation that is taking more time, since this copy was not happening for the emplace case.
You can try digging the assembly code generated and see what exactly is happening.
Also, I dont think comparing the rest of the functions against 'resize_and_index' is fair. There is a possibility that objects are being instantiated more than once in other cases.

Answer 2

First, reserve_and_push and reserve_and_push_move are semantically equivalent. The temporary foo you construct is already an rvalue (the rvalue reference overload of push_back is already used); wrapping it in a move does not change anything, except possibly obscure the code for the compiler, which could explain the slight performance loss. (Though I think it more likely to be noise.) Also, your class has identical copy and move semantics.

Second, the resize_and_index variant might be more optimal if you write the loop's body as

item = foo(5, 5, 5);

although only profiling will show that. The point is that the compiler might generate suboptimal code for the three separate assignments.

Third, you should also try this:

std::vector<foo> v(100, foo(5, 5, 5));

Fourth, this benchmark is extremely sensitive to the compiler realizing that none of these functions actually do anything and simply optimizing their complete bodies out.

Now for analysis. Note that if you really want to know what's going on, you'll have to inspect the assembly the compiler generates.

The first version does the following:

Allocate space for 1000 foos.
Loop and default-construct each one.
Loop over all elements and reassign the values.

The main question here is whether the compiler realizes that the constructor in the second step is a no-op and that it can omit the entire loop. Assembly inspection can show that.

The second and third versions do the following:

Allocate space for 1000 foos.
1000 times:
1. construct a temporary foo object
2. ensure there is still enough allocated space
3. move (for your type, equivalent to a copy, since your class doesn't have special move semantics) the temporary into the allocated space.
4. Increment the vector's size.

There is a lot of room for optimization here for the compiler. If it inlines all operations into the same function, it could realize that the size check is superfluous. It could then realize that your move constructor cannot throw, which means the entire loop is uninterruptible, which means it could merge all the increments into one assignment. If it doesn't inline the push_back, it has to place the temporary in memory and pass a reference to it; there's a number of ways it could special-case this to be more efficient, but it's unlikely that it will.

But unless the compiler does some of these, I would expect this version to be a lot slower than the others.

The fourth version does the following:

Allocate enough space for 1000 foos.
1000 times:
1. ensure there is still enough allocated space
2. create a new object in the allocated space, using the constructor with the three arguments
3. increment the size

This is similar to the previous, with two differences: first, the way the MS standard library implements push_back, it has to check whether the passed reference is a reference into the vector itself; this greatly increases complexity of the function, inhibiting inlining. emplace_back does not have this problem. Second, emplace_back gets three simple scalar arguments instead of a reference to a stack object; if the function doesn't get inlined, this is significantly more efficient to pass.

Unless you work exclusively with Microsoft's compiler, I would strongly recommend you compare with other compilers (and their standard libraries). I also think that my suggested version would beat all four of yours, but I haven't profiled this.

In the end, unless the code is really performance-sensitive, you should write the version that is most readable. (That's another place where my version wins, IMO.)

Answer 3

I am not sure if discrepancy between reserve_and_push and reserve_and_push_move is just noise. I did a simple test using g++ 4.8.4 and noticed increase in executable size/additional assembly instructions, even though theoretically in this case the std::move can be ignored by the complier.

Benchmarking adding elements to vector when size is known

Question

3 answers

solution1
1 2015-12-04 09:28:16

solution2
1 ACCPTED 2015-12-04 09:43:45

solution3
0 2015-12-04 13:31:40

Benchmarking adding elements to vector when size is known

Question

3 answers

solution1 1 2015-12-04 09:28:16

solution2 1 ACCPTED 2015-12-04 09:43:45

solution3 0 2015-12-04 13:31:40

solution1
1 2015-12-04 09:28:16

solution2
1 ACCPTED 2015-12-04 09:43:45

solution3
0 2015-12-04 13:31:40