The performance at iteration (cache miss)

Question

I have found out that the iteration goes through a vector faster when instead of using a variable (i) to count up std::vector<T>::iterator is used.

Thanks to a few comments, here is some additional information: (1) I use the Visual Studio C++ Compiler; (2) I compiled in release mode and with the optimization -O2 :)

Image of the console

If the variable i is incremented, the iteration takes

5875ms:

std::vector<Data> vec(MAX_DATA);
stopWatch.start();
for (unsigned i = 0U; i < MAX_DATA; ++i) {
    vec[i].x = 0;
    vec[i].y = 0;
}
stopWatch.stop();
stopWatch.printSpanAsMs("The data are stored in memory next to each other");

or 5723ms:

std::vector<Data*> vec2;
for (unsigned i = 0U; i < MAX_DATA; ++i)
    vec2.push_back(new Data());

stopWatch.start();
for (unsigned i = 0U; i < MAX_DATA; ++i) {
    vec2[i]->x = 0;
    vec2[i]->y = 0;
}
stopWatch.stop();
stopWatch.printSpanAsMs("The data is in memory at a random position");

If std::vector<Data>::Iterator is used to iterate, the iteration will take

29ms:

std::vector<Data> vec(MAX_DATA);

stopWatch.start();
for (auto& it : vec) {
    it.x = 0;
    it.y = 0;
}
stopWatch.stop();
stopWatch.printSpanAsMs("The data are stored in memory next to each other");

or 110ms:

std::vector<Data*> vec2;
for (unsigned i = 0U; i < MAX_DATA; ++i)
    vec2.push_back(new Data());

stopWatch.start();
for (auto& it : vec2) {
    it->x = 0;
    it->y = 0;
}
stopWatch.stop();
stopWatch.printSpanAsMs("The data is in memory at a random position");

Why is the other iteration so much faster?

I'm wondering that the iteration with the variable i at which the data is at different positions in the memory is as fast as the iteration with the variable i, where data is juxtaposed in the memory. The fact that the data is next to each other in the memory should reduce cache misses and that works with the iteration with std::vector<Data>::Iterator , why not with the other one? Or do I dare and the distance of 29 to 110ms is not the cache misses in debt?

The entire program looks like this:

#include <iostream>
#include <chrono>
#include <vector>
#include <string>

class StopWatch
{
public:
    void start() {
        this->t1 = std::chrono::high_resolution_clock::now();
    }

    void stop() {
        this->t2 = std::chrono::high_resolution_clock::now();
        this->diff = t2 - t1;
    }

    void printSpanAsMs(std::string startText = "time span") {
        long diffAsMs = std::chrono::duration_cast<std::chrono::milliseconds>
        (diff).count();
        std::cout << startText << ": " << diffAsMs << "ms" << std::endl;
    }
private:
    std::chrono::high_resolution_clock::time_point t1, t2;
    std::chrono::high_resolution_clock::duration   diff;
} stopWatch;

struct Data {
    int x, y;
};

const unsigned long MAX_DATA = 20000000;

void test1()
{
    std::cout << "1. Test \n Use i to iterate through the vector" << 
    std::endl;

    std::vector<Data> vec(MAX_DATA);
    stopWatch.start();
    for (unsigned i = 0U; i < MAX_DATA; ++i) {
        vec[i].x = 0;
        vec[i].y = 0;
    }
    stopWatch.stop();
    stopWatch.printSpanAsMs("The data are stored in memory next to each 
    other");

    //////////////////////////////////////////////////

    std::vector<Data*> vec2;
    for (unsigned i = 0U; i < MAX_DATA; ++i)
        vec2.push_back(new Data());

    stopWatch.start();
    for (unsigned i = 0U; i < MAX_DATA; ++i) {
        vec2[i]->x = 0;
        vec2[i]->y = 0;
    }
    stopWatch.stop();
    stopWatch.printSpanAsMs("The data is in memory at a random position");

    for (unsigned i = 0U; i < MAX_DATA; ++i) {
        delete vec2[i];
        vec2[i] = nullptr;
    }
}

void test2()
{
    std::cout << "2. Test \n Use std::vector<T>::iteraror to iterate through 
    the vector" << std::endl;

    std::vector<Data> vec(MAX_DATA);

    stopWatch.start();
    for (auto& it : vec) {
        it.x = 0;
        it.y = 0;
    }
    stopWatch.stop();
    stopWatch.printSpanAsMs("The data are stored in memory next to each 
    other");

    //////////////////////////////////////////////////

    std::vector<Data*> vec2;
    for (unsigned i = 0U; i < MAX_DATA; ++i)
        vec2.push_back(new Data());

    stopWatch.start();
    for (auto& it : vec2) {
        it->x = 0;
        it->y = 0;
    }
    stopWatch.stop();
    stopWatch.printSpanAsMs("The data is in memory at a random position");

    for (auto& it : vec2) {
        delete it;
        it = nullptr;
    }
}

int main()
{
    test1();
    test2();

    system("PAUSE");
    return 0;
}

Answer 1

Why is the other iteration so much faster?

The reason is that MSVC 2017 cannot optimize it properly.

In the first case it completely fails to optimize the loop:

for (unsigned i = 0U; i < MAX_DATA; ++i) {
    vec[i].x = 0;
    vec[i].y = 0;
}

Generated code ( live demo ):

        xor      r9d, r9d
        mov      eax, r9d
$LL4@test1:
        mov      rdx, QWORD PTR [rcx]
        lea      rax, QWORD PTR [rax+16]
        mov      DWORD PTR [rax+rdx-16], r9d
        mov      rdx, QWORD PTR [rcx]
        mov      DWORD PTR [rax+rdx-12], r9d
        mov      rdx, QWORD PTR [rcx]
        mov      DWORD PTR [rax+rdx-8], r9d
        mov      rdx, QWORD PTR [rcx]
        mov      DWORD PTR [rax+rdx-4], r9d
        sub      r8, 1
        jne      SHORT $LL4@test1

Replacing unsigned i with size_t i or hoisting indexed access into a reference doesn't help ( demo ).

The only thing that helps is using an iterator like you have already found out:

for (auto& it : vec) {
    it.x = 0;
    it.y = 0;
}

Generated code ( live demo ):

        xor      ecx, ecx
        npad     2
$LL4@test2:
        mov      QWORD PTR [rax], rcx
        add      rax, 8
        cmp      rax, rdx
        jne      SHORT $LL4@test2

clang just calls memset in both cases.

The moral of the story: look at the generated code if you care about performance. Report issues to the vendor.

The performance at iteration (cache miss)

Question

1 answers

solution1
1 ACCPTED 2017-10-30 15:50:05

The performance at iteration (cache miss)

Question

1 answers

solution1 1 ACCPTED 2017-10-30 15:50:05

solution1
1 ACCPTED 2017-10-30 15:50:05