I've come across this piece of code in some forum:
if ( a * b * c * d == 0 ) ....
and the owner claims this is a tad faster than
if (a == 0 || b == 0 || c == 0 || d == 0)
These variables are defined as:
int a, b, c, d;
And their absolute values are guaranteed to be less than or equal to 100. (So we could ignore the possibility of overflowing)
If we just ignore the readability
and just focus on the performance, is the claim really correct?
It seems to me that the second approach might actually be faster since you could take advantage of 'short-circuit' sometimes. But then, what-do-I-know?!
The C standard says nothing about performance. The question of whether
if ( a * b * c * d == 0 )
is faster than
if (a == 0 || b == 0 || c == 0 || d == 0)
is meaningful only in the context of a particular compiler generating code running on a particular machine. The only real way to compare them is to measure the performance on your own system, or on whatever system you're interested in.
Still, we can speculate about what the performance is likely to be.
As you said, a
, b
, c
, and d
are objects of type int
. You also said they're in the range [-100,+100] -- but the compiler doesn't necessarily know that.
A compiler is free to replace any expression with code that does the same thing.
Multiplication is a relatively complex operation, and is likely to be slower than, say, addition or comparison. A compiler could recognize that the first condition will be true if any of the four variables has the value 0
, and replace the multiplications with whatever happens to be faster. But each optimization a compiler performs has to be explicitly programmed by the compiler's developers, and this particular pattern isn't likely to be common enough for it to be worth the effort of recognizing it.
You say the values are small enough that overflow isn't an issue. In fact, you can't portably make that assumption; INT_MAX
can be as small as 32767
. But the compiler knows how big an int
is on the system for which it's generating code. Still, unless it has information about the values of a
, b
, c
, and d
, it can't assume that there will be no overflow.
Except that yes, actually, it can make that assumption. The behavior of signed integer overflow is undefined. That gives an optimizing compiler permission to assume that overflow can't occur (if it does, whatever behavior the program exhibits is valid anyway).
So yes, a compiler could replace the multiplications with something simpler, but it's not likely to do so.
As for the other expression, a == 0 || b == 0 || c == 0 || d == 0
a == 0 || b == 0 || c == 0 || d == 0
a == 0 || b == 0 || c == 0 || d == 0
, the ||
operator has short-circuit semantics; if the left operand is true (non-zero), then the right operand isn't evaluated. And that kind of conditional code can create performance issues due to CPU pipeline issues. Since none of the subexpressions have side effects (assuming none of the variables are declared volatile
), the compiler can evaluate all four subexpressions, perhaps in parallel, if that's faster.
A quick experiment shows that gcc -O3
for x86 doesn't perform either optimization. For the first expression, it generates code that performs three multiplications. For the second, it generates conditional branches, implementing the canonical short-circuit evaluations (I don't know whether avoiding that would be faster or not).
Your best bet is to write reasonable code that's as straightforward as possible, both because it makes your source code easier to read and maintain, and because it's likely to give the compiler a better chance to recognize patterns and perform optimizations. If you try to do fancy micro-optimizations in your source code, you're as likely to hinder the compiler's optimizations as you are to help.
Don't worry too much about how fast your code is unless you've measured it and found it to be too slow. If you need your code to be faster, first concentrate on improved algorithms and data structures. And only if that fails, consider source-level micro-optimizations.
The First Rule of Program Optimization: Don't do it. The Second Rule of Program Optimization (for experts only!): Don't do it yet.
The two are not equivalent. For example on my machine (32-bit x86 MSVC) if a, b, c and d are all equal to 0x100
then the first test will pass but the second condition will not.
Also note that multiplication is a costly operation, so the first version won't necessarily be faster.
EDIT: Code generated for the first version:
00401000 8B 44 24 04 mov eax,dword ptr [esp+4]
00401004 0F AF 44 24 08 imul eax,dword ptr [esp+8]
00401009 0F AF 44 24 0C imul eax,dword ptr [esp+0Ch]
0040100E 0F AF 44 24 10 imul eax,dword ptr [esp+10h]
00401013 85 C0 test eax,eax
00401015 75 07 jne f1+1Eh (40101Eh)
00401017 ...
Code generated for the second version:
00401020 83 7C 24 04 00 cmp dword ptr [esp+4],0
00401025 74 15 je f2+1Ch (40103Ch)
00401027 83 7C 24 08 00 cmp dword ptr [esp+8],0
0040102C 74 0E je f2+1Ch (40103Ch)
0040102E 83 7C 24 0C 00 cmp dword ptr [esp+0Ch],0
00401033 74 07 je f2+1Ch (40103Ch)
00401035 83 7C 24 10 00 cmp dword ptr [esp+10h],0
0040103A 75 07 jne f2+23h (401043h)
0040103C ...
Benchmarks on my machine (in nanoseconds): the first version runs in about 1.83 ns and the second in about 1.39 ns. The values of a, b, c and d didn't change during each run, so apparently the branch predictor could predict 100% of the branches.
So as usual with which is faster questions, is what have you tried so far? Did you compile and disassemble and see what happens?
unsigned int mfun ( unsigned int a, unsigned int b, unsigned int c, unsigned int d )
{
if ( a * b * c * d == 0 ) return(7);
else return(11);
}
unsigned int ofun ( unsigned int a, unsigned int b, unsigned int c, unsigned int d )
{
if (a == 0 || b == 0 || c == 0 || d == 0) return(7);
else return(11);
}
for arm one compiler gives this
00000000 <mfun>:
0: e0010190 mul r1, r0, r1
4: e0020291 mul r2, r1, r2
8: e0110293 muls r1, r3, r2
c: 13a0000b movne r0, #11
10: 03a00007 moveq r0, #7
14: e12fff1e bx lr
00000018 <ofun>:
18: e3500000 cmp r0, #0
1c: 13510000 cmpne r1, #0
20: 0a000004 beq 38 <ofun+0x20>
24: e3520000 cmp r2, #0
28: 13530000 cmpne r3, #0
2c: 13a0000b movne r0, #11
30: 03a00007 moveq r0, #7
34: e12fff1e bx lr
38: e3a00007 mov r0, #7
3c: e12fff1e bx lr
so the equals and ors have short circuits (which are themselves costly) but the worst path takes longer so the performance is erratic, the multiply performance is more deterministic and less erratic. By inspection the multiply solution should be faster for the above code.
mips gave me this
00000000 <mfun>:
0: 00a40018 mult a1,a0
4: 00002012 mflo a0
...
10: 00860018 mult a0,a2
14: 00002012 mflo a0
...
20: 00870018 mult a0,a3
24: 00002012 mflo a0
28: 10800003 beqz a0,38 <mfun+0x38>
2c: 00000000 nop
30: 03e00008 jr ra
34: 2402000b li v0,11
38: 03e00008 jr ra
3c: 24020007 li v0,7
00000040 <ofun>:
40: 10800009 beqz a0,68 <ofun+0x28>
44: 00000000 nop
48: 10a00007 beqz a1,68 <ofun+0x28>
4c: 00000000 nop
50: 10c00005 beqz a2,68 <ofun+0x28>
54: 00000000 nop
58: 10e00003 beqz a3,68 <ofun+0x28>
5c: 00000000 nop
60: 03e00008 jr ra
64: 2402000b li v0,11
68: 03e00008 jr ra
6c: 24020007 li v0,7
Unless the branches are too costly the equals and ors looks faster.
Openrisc 32
00000000 <mfun>:
0: e0 64 1b 06 l.mul r3,r4,r3
4: e0 a3 2b 06 l.mul r5,r3,r5
8: e0 c5 33 06 l.mul r6,r5,r6
c: bc 26 00 00 l.sfnei r6,0x0
10: 0c 00 00 04 l.bnf 20 <mfun+0x20>
14: 9d 60 00 0b l.addi r11,r0,0xb
18: 44 00 48 00 l.jr r9
1c: 15 00 00 00 l.nop 0x0
20: 44 00 48 00 l.jr r9
24: 9d 60 00 07 l.addi r11,r0,0x7
00000028 <ofun>:
28: e0 e0 20 02 l.sub r7,r0,r4
2c: e0 87 20 04 l.or r4,r7,r4
30: bd 64 00 00 l.sfgesi r4,0x0
34: 10 00 00 10 l.bf 74 <ofun+0x4c>
38: e0 80 18 02 l.sub r4,r0,r3
3c: e0 64 18 04 l.or r3,r4,r3
40: bd 63 00 00 l.sfgesi r3,0x0
44: 10 00 00 0c l.bf 74 <ofun+0x4c>
48: e0 60 30 02 l.sub r3,r0,r6
4c: e0 c3 30 04 l.or r6,r3,r6
50: bd 66 00 00 l.sfgesi r6,0x0
54: 10 00 00 08 l.bf 74 <ofun+0x4c>
58: e0 60 28 02 l.sub r3,r0,r5
5c: e0 a3 28 04 l.or r5,r3,r5
60: bd 85 00 00 l.sfltsi r5,0x0
64: 0c 00 00 04 l.bnf 74 <ofun+0x4c>
68: 9d 60 00 0b l.addi r11,r0,0xb
6c: 44 00 48 00 l.jr r9
70: 15 00 00 00 l.nop 0x0
74: 44 00 48 00 l.jr r9
78: 9d 60 00 07 l.addi r11,r0,0x7
this depends on the implementation of multiply, if it is one clock then the multiplies have it.
If your hardware doesnt support multiply then you have to make a call to have it simulated
00000000 <mfun>:
0: 0b 12 push r11
2: 0a 12 push r10
4: 09 12 push r9
6: 09 4d mov r13, r9
8: 0b 4c mov r12, r11
a: 0a 4e mov r14, r10
c: 0c 4f mov r15, r12
e: b0 12 00 00 call #0x0000
12: 0a 4e mov r14, r10
14: 0c 49 mov r9, r12
16: b0 12 00 00 call #0x0000
1a: 0a 4e mov r14, r10
1c: 0c 4b mov r11, r12
1e: b0 12 00 00 call #0x0000
22: 0e 93 tst r14
24: 06 24 jz $+14 ;abs 0x32
26: 3f 40 0b 00 mov #11, r15 ;#0x000b
2a: 39 41 pop r9
2c: 3a 41 pop r10
2e: 3b 41 pop r11
30: 30 41 ret
32: 3f 40 07 00 mov #7, r15 ;#0x0007
36: 39 41 pop r9
38: 3a 41 pop r10
3a: 3b 41 pop r11
3c: 30 41 ret
0000003e <ofun>:
3e: 0f 93 tst r15
40: 09 24 jz $+20 ;abs 0x54
42: 0e 93 tst r14
44: 07 24 jz $+16 ;abs 0x54
46: 0d 93 tst r13
48: 05 24 jz $+12 ;abs 0x54
4a: 0c 93 tst r12
4c: 03 24 jz $+8 ;abs 0x54
4e: 3f 40 0b 00 mov #11, r15 ;#0x000b
52: 30 41 ret
54: 3f 40 07 00 mov #7, r15 ;#0x0007
58: 30 41
You would hope that the two are equivalent, and from a pure mathematical sense they should be, to get a result of the multiplies to be zero one operand needs to be zero. problem is this is software for a processor, you can easily overflow on a multiply and have non-zero operands and still get zero so to properly implement the code the multiplies have to happen.
because of the cost of mul and divide in particular you should avoid them as much as possible in your software, your multiply solution in this case for the two solutions to be equivalent would require even more code to detect or prevent the overflow cases that can lead to a false positive. Yes, many processors perform mul in one clock, and divide as well, the reason why you dont see divide, and sometimes dont see mul implemented in the instruction set is because the chip real estate required, the expense is now power, heat, the cost of the part, etc. So mul and divide remain expensive, not limited to these of course but they do create long poles in the tent as to the performance of the part, the clock rate, folks want single clock operation not realizing that one instruction may slow the whole chip down, allowing it to be multi-clock might bring your overall clock rate up. so many things are long poles in the tent, so removing mul might not change performance, it all depends...
Yes when the if instruction fail, cause in this case we do at most 4 comparisons (Operations)
in the second instruction, and for the first instruction we always do 4 operations
.
Edit : Explanation
The second if instruction is always faster than the first one:
Suppose that : a = 1, b =2, c =0 and d = 4, in this case :
For the first instruction : we have 3 multiplications and a comparison = 4 operations
For the second if instruction : we compare a to 0 (result KO) then b to 0 (again KO) and c to 0 (OK) = 3 operations.
This is a simple program that output the execution time for this 2 instructions, you can modify a, b, c and d and passé the number of the instruction as argument.
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
/* This is a test program to demonstrate that the second if is faster always than the first one*/
int main(int argc, char **argv)
{
int i;
int a = 1;
int b = 2;
int c = 0;
int d = 4;
int instruction_number;
clock_t begin, end;
double time_spent;
begin = clock();
if (argc != 2)
{
fprintf(stderr, "Usage : ./a.out if_instruction_number (1 or 2)\n");
exit(EXIT_FAILURE);
}
instruction_number = atoi(argv[1]);
for (i = 1; i < 100000; i++)
{
switch (instruction_number)
{
case 1:
fprintf(stdout, "First if instruction : \n");
if (a * b * c * d == 0)
fprintf(stdout, "1st instruction\n");
break;
case 2:
fprintf(stdout, "Second if instruction : \n");
if (a == 0 || b == 0 || c == 0 || d == 0)
fprintf(stdout, "2nd instruction\n");
break;
default:
break;
}
}
end = clock();
time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
fprintf(stdout, "Time to accomplish %d instruction ---> %f\n", instruction_number, time_spent);
return 0;
}
Hope this help.
Regards.
if ( a * b * c * d == 0 )
compiles to (without optimizations)
movl 16(%esp), %eax
imull 20(%esp), %eax
imull 24(%esp), %eax
imull 28(%esp), %eax
testl %eax, %eax
jne .L3
and if (a == 0 || b == 0 || c == 0 || d == 0)
compiles to
cmpl $0, 16(%esp)
je .L2
cmpl $0, 20(%esp)
je .L2
cmpl $0, 24(%esp)
je .L2
cmpl $0, 28(%esp)
jne .L4
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.