I'm running an x86 processor, but I believe my question is pretty general. I'm curious about the theoretical difference in clock cycles consumed by a CMP + JE
sequence versus a single MUL
operation.
In C pseudocode:
unsigned foo = 1; /* must be 0 or 1 */
unsigned num = 0;
/* Method 1: CMP + JE*/
if(foo == 1){
num = 5;
}
/* Method 2: MUL */
num = foo*5; /* num = 0 if foo = 0 */
Don't look too deeply into the pseudocode, it's purely there to illuminate the mathematical logic behind the two methods.
What I'm actually comparing are the following two sequences of instructions:
Method 1: CMP + JE
MOV EAX, 1 ; FOO = 1 here, but can be set to 0
MOV EBX, 0 ; NUM = 0
CMP EAX, 1 ; if(foo == 1)
JE SUCCESS ; enter branch
JMP FINISH ; end program
SUCCESS:
MOV EBX, 5 ; num = 5
FINISH:
Method 2: MUL
MOV EAX, 1 ; FOO = 1 here, but can be set to 0
MOV ECX, EAX ; save copy of FOO to ECX
MUL ECX, 5 ; result = foo*5
MOV EBX, ECX ; num = result = foo*5
It seems that a single MUL
(4 total instructions) is more efficient than a CMP + JE
(6 total instructions), but are clock cycles consumed equally for instructions -- ie is the number of clock cycles it takes to complete an instruction that same for any other instruction?
If the actual clock cycles consumed is dependent on the machine, is a single MUL
typically faster than the branching approach on most processors, since it requires fewer total instructions?
Modern CPU performance is much more complicated than just counting the number of cycles for each instruction. You need to take all of the following into account (at least):
All of these will be heavily influenced by the surrounding code.
So essentially, it's almost impossible to perform a micro-benchmark like this and obtain a useful result!
However, if I had to guess, I'd say that the code without the JE will be more efficient in general, as it eliminates the branch, which simplifies the branch-prediction behaviour.
Typically, on a modern x86 processor, both the CMP
and the MUL
instruction will occupy an integer execution unit for one cycle ( CMP
is essentially a SUB
that throws away the result and just modifies the flags register). However, modern x86 processors are also pipelined, superscalar and out-of-order, which means that the performance depends on more than just this underlying cycle cost alone.
If the branch cannot be predicted well, then the branch misprediction penalty will swamp other factors and the MUL
version will perform significantly better.
On the other hand, if the branch can be well predicted and you immediately use num
in a subsequent calculation, then it's possible for the branching version to perform better in the average case. That's because when it correctly predicts the branch, it can start speculatively executing the next instruction using the predicted value of num
, before the result of the compare is available (whereas in the MUL
case, subsequent use of num
will have a data dependency on the result of the MUL
- it won't be able to execute until that result is retired).
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.