strange behavior of x86 “cmp” instruction

Question

Here is the code:

#include <iostream>
#include <time.h>

using namespace std;

#define ARR_LENGTH 1000000
#define TEST_NUM 0
typedef unsigned int uint;

uint arr[ARR_LENGTH];

uint inc_time(uint x) {
    uint y = 0, tm = clock();
    for (uint i = 0; i < x; i++) y++;
        return clock() - tm;
}

int main() {
    uint div = 0, mod = 0, tm = 0, overall = 0, inc_tm;
    srand(time(NULL));
    for (uint i = 0; i < ARR_LENGTH; i++) arr[i] = (uint)rand() + 2;

    tm = clock();
    for (uint i = 0; i < ARR_LENGTH - 1; i++)
        if (arr[i] % arr[i+1] != TEST_NUM) mod++;
    overall = clock() - tm;
    inc_tm = inc_time(mod);
    cout << "mods - " << mod << endl;
    cout << "Overall time - " << overall<< endl;
    cout << "   wasted on increment - " << inc_tm << endl;
    cout << "   wasted on condition - " << overall - inc_tm << endl << endl;

    tm = clock();
    for (uint i = 0; i < ARR_LENGTH - 1; i++)
        if (arr[i]/arr[i+1] != TEST_NUM) div++;
    overall = clock()-tm;
    inc_tm = inc_time(div);
    cout << "divs - " << div << endl;
    cout << "Overall time - " << overall << endl;
    cout << "   wasted on increment - " << inc_tm << endl;
    cout << "   wasted on condition - " << overall - inc_tm << endl << endl;

    return 0;
}

If you're using Visual Studio, just compile in DEBUG (not RELEASE) mode and if you're using GCC than disable dead code elimination ( -fno-dce ), otherwise some parts of code will not work.

So the question is: When you set the TEST_NUM constant to non-zero (say 5), both the conditions (modulo and division) are perfoming approximately at the same time, but when you set TEST_NUM to 0, the second condition performs slower (up to 3 times!). Why?

Here is the disassembly listing: disassembly listing image http://img213.imageshack.us/slideshow/webplayer.php?id=wp000076.jpg

In case of 0 the test instruction is used instead of cmp X, 0 but even if you patch cmp X, 5 (in case of 5) to cmp X, 0 you'll see that it wouldn't affect the modulo operation, but would affect the division operation.

Carefully watch how the operations counts and times are changing while you change the TEST_NUM constant.

If anybody can, please explain how can this happen?
Thanks.

Answer 1

In the case of TEST_NUM == 0 , the first condition is rarely true. The branch prediction will recognize this and predict the condition as always false. This prediction will be correct in most cases, so an expensive wrong predicted branch needs rarely to be executed.

Almost the same goes for the case 'TEST_NUM == 5': The first condition will rarely be true.

For the second condition abd TEST_NUM == 0 , the result of the division is zero for each arr[i] < arr[i+1] which has a probability of about 0.5. This is the worst case for a branch predictor - the branch will be predicted wrong in every second case. In average, you will get half of the clock cycles needed for a wrong predicted branch (depending on the architecture this may be between 10 to 20 cycles).

If you have a value of TEST_NUM == 5 , the second condition is now rarely true, the probability will be about 0.1 (not quite sure here). This is much better "predictable". Tpically the predictor will predict as (almost) always false, with some random trues in between, but that depends on the innards of the processors. But in any case, you get the additional cycles for a wrong predicted branch not so often, a worst in every fifth case.

strange behavior of x86 “cmp” instruction

Question

1 answers

solution1
6 ACCPTED 2011-12-04 23:51:53

strange behavior of x86 “cmp” instruction

Question

1 answers

solution1 6 ACCPTED 2011-12-04 23:51:53

solution1
6 ACCPTED 2011-12-04 23:51:53