Branch prediction optimizations

Question

I am trying to understand what kind of magic optimizations gcc/clang does with this code.

#include <random>
#include <iostream>

int main()
{
    std::random_device rd;
    std::mt19937 mt(rd());
    const unsigned arraySize = 100000;
    int data[arraySize];

    for (unsigned c = 0; c < arraySize; ++c)
        data[c] = mt() % 256;

    long long sum = 0;

    for (unsigned i = 0; i < 100000; ++i)
    {
        for (unsigned c = 0; c < arraySize; ++c)
        {
            if (data[c] >= 128)
                sum += data[c];
        }
    }
    std::cout << sum << std::endl;
}

and this code

#include <random>
#include <iostream>
#include <algorithm>
int main()
{
    std::random_device rd;
    std::mt19937 mt(rd());
    const unsigned arraySize = 100000;
    int data[arraySize];

    for (unsigned c = 0; c < arraySize; ++c)
        data[c] = mt() % 256;

    std::sort(data, data + arraySize);
    long long sum = 0;

    for (unsigned i = 0; i < 100000; ++i)
    {
        for (unsigned c = 0; c < arraySize; ++c)
        {
            if (data[c] >= 128)
                sum += data[c];
        }
    }
    std::cout << sum << std::endl;
}

Basicly when I compiled it and run about 3 years ago, the second code was like 4x faster because of much better branch prediction. When I compile it and run now, it works in almost the same time, and I have no idea what kind of sorcery gcc/clang does.

Answer 1

Here is the output from gcc (using gcc.godbolt.org, with -O3)

.L4: //Inner loop
    movslq  (%rax), %rdx
    movq    %rdx, %rcx
    addq    %rsi, %rdx
    cmpl    $127, %ecx
    cmovg   %rdx, %rsi
    addq    $4, %rax
    cmpq    %rdi, %rax
    jne .L4

You can see it does the comparison "cmpl $127,$ecx", however after the comparison it does not branch. Instead it always adds (using "addq", in the line above the comparison) and then uses the result of the add depending on the comparison (thanks to the "cmovg" "conditional move" instruction).

It avoids branching in the inner loop so the performance is not depending on branch prediction. Because of this, it makes no difference to sort the input (as you do in the second example).

Branch prediction optimizations

Question

1 answers

solution1
2 ACCPTED 2015-03-16 12:28:15

Branch prediction optimizations

Question

1 answers

solution1 2 ACCPTED 2015-03-16 12:28:15

solution1
2 ACCPTED 2015-03-16 12:28:15