Does a CMP+JE consume more clock cycles than a single MUL?

Question

I'm running an x86 processor, but I believe my question is pretty general. I'm curious about the theoretical difference in clock cycles consumed by a CMP + JE sequence versus a single MUL operation.

In C pseudocode:

unsigned foo = 1;    /* must be 0 or 1 */
unsigned num = 0;

/* Method 1: CMP + JE*/
if(foo == 1){
    num = 5;
}

/* Method 2: MUL */
num = foo*5;    /* num = 0 if foo = 0 */

Don't look too deeply into the pseudocode, it's purely there to illuminate the mathematical logic behind the two methods.

What I'm actually comparing are the following two sequences of instructions:

Method 1: CMP + JE

    MOV EAX, 1    ; FOO = 1 here, but can be set to 0
    MOV EBX, 0    ; NUM = 0

    CMP EAX, 1    ; if(foo == 1)
    JE  SUCCESS   ; enter branch
    JMP FINISH    ; end program

SUCCESS:
    MOV EBX, 5    ; num = 5

FINISH:

Method 2: MUL

    MOV EAX, 1    ; FOO = 1 here, but can be set to 0

    MOV ECX, EAX  ; save copy of FOO to ECX
    MUL ECX, 5    ; result = foo*5
    MOV EBX, ECX  ; num = result = foo*5

It seems that a single MUL (4 total instructions) is more efficient than a CMP + JE (6 total instructions), but are clock cycles consumed equally for instructions -- ie is the number of clock cycles it takes to complete an instruction that same for any other instruction?

If the actual clock cycles consumed is dependent on the machine, is a single MUL typically faster than the branching approach on most processors, since it requires fewer total instructions?

Answer 1

Modern CPU performance is much more complicated than just counting the number of cycles for each instruction. You need to take all of the following into account (at least):

Branch prediction
Instruction reordering
Register renaming
Instruction cache hits/misses
Data cache hits/misses
TLB misses/page faults

All of these will be heavily influenced by the surrounding code.

So essentially, it's almost impossible to perform a micro-benchmark like this and obtain a useful result!

However, if I had to guess, I'd say that the code without the JE will be more efficient in general, as it eliminates the branch, which simplifies the branch-prediction behaviour.

Answer 2

Typically, on a modern x86 processor, both the CMP and the MUL instruction will occupy an integer execution unit for one cycle ( CMP is essentially a SUB that throws away the result and just modifies the flags register). However, modern x86 processors are also pipelined, superscalar and out-of-order, which means that the performance depends on more than just this underlying cycle cost alone.

If the branch cannot be predicted well, then the branch misprediction penalty will swamp other factors and the MUL version will perform significantly better.

On the other hand, if the branch can be well predicted and you immediately use num in a subsequent calculation, then it's possible for the branching version to perform better in the average case. That's because when it correctly predicts the branch, it can start speculatively executing the next instruction using the predicted value of num , before the result of the compare is available (whereas in the MUL case, subsequent use of num will have a data dependency on the result of the MUL - it won't be able to execute until that result is retired).

Does a CMP+JE consume more clock cycles than a single MUL?

Question

2 answers

solution1
11 ACCPTED 2013-05-29 18:56:04

solution2
1 2013-05-30 00:20:07

Does a CMP+JE consume more clock cycles than a single MUL?

Question

2 answers

solution1 11 ACCPTED 2013-05-29 18:56:04

solution2 1 2013-05-30 00:20:07

solution1
11 ACCPTED 2013-05-29 18:56:04

solution2
1 2013-05-30 00:20:07