Why access in a for loop is faster than access in a ranged-for in -O0 but not in -O3?

Question

I'm learning performance in C++ (and C++11). And I need to performance in Debug and Release mode because I spend time in debugging and in executing.

I'm surprise with this two tests and how much change with the different compiler flags optimizations.

Test iterator 1:

Optimization 0 (-O0): faster.
Optimization 3 (-O3): slower.

Test iterator 2:

Optimization 0 (-O0): slower.
Optimization 3 (-O3): faster.

PD: I use the following clock code .

Test iterator 1:

void test_iterator_1()
{
    int z = 0;
    int nv = 1200000000;
    std::vector<int> v(nv);

    size_t count = v.size();

    for (unsigned int i = 0; i < count; ++i) {
        v[i] = 1;
    }
}

Test iterator 2:

void test_iterator_2()
{
    int z = 0;
    int nv = 1200000000;
    std::vector<int> v(nv);

    for (int& i : v) {
        i = 1;
    }
}

UPDATE : The problem is still the same, but for ranged-for in -O3 the differences is small. So for loop 1 is the best .

UPDATE 2 : Results:

With -O3:

t1: 80 units
t2: 74 units

With -O0:

t1: 287 units
t2: 538 units

UPDATE 3: The CODE ! . Compile with: g++ -std=c++11 test.cpp -O0 (and then -O3)

Answer 1

Your first test is actually setting the value of each element in the vector to 1.

Your second test is setting the value of a copy of each element in the vector to 1 (the original vector is the same).

When you optimize, the second loop more than likely is removed entirely as it is basically doing nothing.

If you want the second loop to actually set the value:

for (int& i : v) // notice the & 
{
    i = 1;
}

Once you make that change, your loops are likely to produce assembly code that is almost identical.

As a side note, if you wanted to initialize the entire vector to a single value, the better way to do it is:

std::vector<int> v(SIZE, 1);

EDIT

The assembly is fairly long (100+ lines), so I won't post it all, but a couple things to note:

Version 1 will store a value for count and increment i , testing for it each time. Version 2 uses iterators (basically the same as std::for_each(b.begin(), v.end() ...) ). So the code for the loop maintenance is very different (it is more setup for version 2, but less work each iteration).

Version 1 (just the meat of the loop)

mov eax, DWORD PTR _i$2[ebp]
push    eax
lea ecx, DWORD PTR _v$[ebp]
call    ??A?$vector@HV?$allocator@H@std@@@std@@QAEAAHI@Z ; std::vector<int,std::allocator<int> >::operator[]
mov DWORD PTR [eax], 1

Version 2 (just the meat of the loop)

mov eax, DWORD PTR _i$2[ebp]
mov DWORD PTR [eax], 1

When they get optimized, this all changes and (other than the ordering of a few instructions), the output is almost identical.

Version 1 (optimized)

    push    ebp
    mov ebp, esp
    sub esp, 12                 ; 0000000cH
    push    ecx
    lea ecx, DWORD PTR _v$[ebp]
    mov DWORD PTR _v$[ebp], 0
    mov DWORD PTR _v$[ebp+4], 0
    mov DWORD PTR _v$[ebp+8], 0
    call    ?resize@?$vector@HV?$allocator@H@std@@@std@@QAEXI@Z ; std::vector<int,std::allocator<int> >::resize
    mov ecx, DWORD PTR _v$[ebp+4]
    mov edx, DWORD PTR _v$[ebp]
    sub ecx, edx
    sar ecx, 2 ; this is the only differing instruction
    test    ecx, ecx
    je  SHORT $LN3@test_itera
    push    edi
    mov eax, 1
    mov edi, edx
    rep stosd
    pop edi
$LN3@test_itera:
    test    edx, edx
    je  SHORT $LN21@test_itera
    push    edx
    call    DWORD PTR __imp_??3@YAXPAX@Z
    add esp, 4
$LN21@test_itera:
    mov esp, ebp
    pop ebp
    ret 0

Version 2 (optimized)

    push    ebp
    mov ebp, esp
    sub esp, 12                 ; 0000000cH
    push    ecx
    lea ecx, DWORD PTR _v$[ebp]
    mov DWORD PTR _v$[ebp], 0
    mov DWORD PTR _v$[ebp+4], 0
    mov DWORD PTR _v$[ebp+8], 0
    call    ?resize@?$vector@HV?$allocator@H@std@@@std@@QAEXI@Z ; std::vector<int,std::allocator<int> >::resize
    mov edx, DWORD PTR _v$[ebp]
    mov ecx, DWORD PTR _v$[ebp+4]
    mov eax, edx
    cmp edx, ecx
    je  SHORT $LN1@test_itera
$LL33@test_itera:
    mov DWORD PTR [eax], 1
    add eax, 4
    cmp eax, ecx
    jne SHORT $LL33@test_itera
$LN1@test_itera:
    test    edx, edx
    je  SHORT $LN47@test_itera
    push    edx
    call    DWORD PTR __imp_??3@YAXPAX@Z
    add esp, 4
$LN47@test_itera:
    mov esp, ebp
    pop ebp
    ret 0

Answer 2

Do not worry about how much time each operation takes, that falls squarely under the premature optimization is the root of all evil quote by Donald Knuth. Write easy to understand, simple programs, your time while writing the program (and reading it next week to tweak it, or to find out why the &%$# it is giving crazy results) is much more valuable than any computer time wasted. Just compare your weekly income to the price of an off-the-shelf machine, and think how much of your time is required to shave off a few minutes of compute time.

Do worry when you have measurements showing that the performance isn't adequate. Then you must measure where your runtime (or memory, or whatever else resource is critical) is spent, and see how to make that better. The (sadly out of print) book "Writing Efficient Programs" by Jon Bentley (much of it also appears in his "Programming Pearls") is an eye-opener, and a must read for any budding programmer.

Answer 3

Optimization is pattern matching: The compiler has a number of different situations it can recognize and optimize. If you change the code in a way that makes the pattern unrecognizable to the compiler , suddenly the effect of your optimization vanishes.

So, what you are witnessing is nothing more or less than that the ranged for loop produces more bloated code without optimization, but that in this form the optimizer is able to recognize a pattern that it cannot recognize for the iterator-free case.

In any case, if you are curious, you should take a look at the produced assembler code (compile with -S option).

Why access in a for loop is faster than access in a ranged-for in -O0 but not in -O3?

Question

3 answers

solution1
7 2014-02-11 19:20:33

solution2
0 2014-02-11 19:44:06

solution3
0 2014-02-11 20:19:18

Why access in a for loop is faster than access in a ranged-for in -O0 but not in -O3?

Question

3 answers

solution1 7 2014-02-11 19:20:33

solution2 0 2014-02-11 19:44:06

solution3 0 2014-02-11 20:19:18

solution1
7 2014-02-11 19:20:33

solution2
0 2014-02-11 19:44:06

solution3
0 2014-02-11 20:19:18