[英]C++: a*a-b*b vs (a+b)*(a-b) what is faster to compute?
Which way of computing difference of squares in C++ is faster: a*ab*b
or (a+b)*(ab)
?在 C++ 中计算平方差的哪种方法更快:
a*ab*b
或(a+b)*(ab)
? The first expression uses two multiplications and one addition, while the second one needs two additions and one multiplication.第一个表达式使用两次乘法和一次加法,而第二个表达式需要两次加法和一次乘法。 So the second approach seems faster.
所以第二种方法似乎更快。 On the other hand, the number of loads of data to registers in the first approach is smaller, and this might compensate one multiplication vs addition.
另一方面,在第一种方法中加载到寄存器的数据数量较少,这可能会补偿一个乘法与加法。
If you run this code如果您运行此代码
#include <iostream>
int main()
{
int a = 6, b = 7;
int c1 = a*a-b*b;
int c2 = (a-b)*(a+b);
return 0;
}
say here and without optimization flags -O, then the number of assembler instruction will be the same:在这里说并且没有优化标志-O,那么汇编指令的数量将是相同的:
for the line: int c1 = a*ab*b;
对于该行:
int c1 = a*ab*b;
: :
mov eax,DWORD PTR [rbp-0x4]
imul eax,eax
mov edx,eax
mov eax,DWORD PTR [rbp-0x8]
imul eax,eax
sub edx,eax
mov DWORD PTR [rbp-0xc],edx
for the line: int c2 = (ab)*(a+b);
对于该行:
int c2 = (ab)*(a+b);
: :
mov eax,DWORD PTR [rbp-0x4]
sub eax,DWORD PTR [rbp-0x8]
mov ecx,DWORD PTR [rbp-0x4]
mov edx,DWORD PTR [rbp-0x8]
add edx,ecx
imul eax,edx
mov DWORD PTR [rbp-0x10],eax
On the other hand, the first collection of instructions contains 4 operations which are produced only between registers, while for the second collection only 2 such operations between registers are presented, and the others use memory and registers.另一方面,第一个指令集合包含 4 个仅在寄存器之间产生的操作,而对于第二个集合,仅提供了 2 个寄存器之间的此类操作,其他指令使用内存和寄存器。
So the question is also whether it is possible to estimate which of collections of instructions is faster?所以问题也是是否可以估计哪个指令集合更快?
Added after answers.答案后添加。
Thank you for responding I found the answer.感谢您的回复,我找到了答案。 Look at the following code :
看下面的代码:
#include <iostream>
int dsq1(int a, int b)
{
return a*a-b*b;
};
int dsq2(int a, int b)
{
return (a+b)*(a-b);
};
int main()
{
int a,b;
// just to be sure that the compiler does not know
// precise values of a and b and will not optimize them
std::cin >> a;
std::cin >> b;
volatile int c1 = dsq1(a,b);
volatile int c2 = dsq2(a,b);
return 0;
}
Now the first function for a*ab*b
takes the following 5 assembler instructions with two multiplications:现在
a*ab*b
的第一个函数采用以下 5 条汇编指令和两次乘法:
mov esi,eax
mov ecx,edx
imul esi,eax
imul ecx,edx
sub ecx,esi
while (ab)*(a+b)
takes only 4 instructions and only one multiplication:而
(ab)*(a+b)
只需要 4 条指令和一次乘法:
mov ecx,edx
sub ecx,eax
add eax,edx
imul eax,ecx
It seems that (ab)*(a+b)
should be faster than a*ab*b
.似乎
(ab)*(a+b)
应该比a*ab*b
快。
Now this really depends on the compiler and architecture.现在这真的取决于编译器和架构。 Lets look these two functions:
让我们看看这两个函数:
int f1(int a, int b) {
return a*a-b*b;
}
int f2(int a, int b) {
return (a-b)*(a+b);
}
Lets look what that produces on x86_64:让我们看看在 x86_64 上产生了什么:
MSVC MSVC
a$ = 8
b$ = 16
int f1(int,int) PROC ; f1, COMDAT
imul ecx, ecx
imul edx, edx
sub ecx, edx
mov eax, ecx
ret 0
int f1(int,int) ENDP ; f1
a$ = 8
b$ = 16
int f2(int,int) PROC ; f2, COMDAT
mov eax, ecx
add ecx, edx
sub eax, edx
imul eax, ecx
ret 0
int f2(int,int) ENDP ; f2
gcc 12.1 GCC 12.1
f1(int, int):
imul edi, edi
imul esi, esi
mov eax, edi
sub eax, esi
ret
f2(int, int):
mov eax, edi
add edi, esi
sub eax, esi
imul eax, edi
ret
clang 14.0铿锵声14.0
f1(int, int): # @f1(int, int)
mov eax, edi
imul eax, edi
imul esi, esi
sub eax, esi
ret
f2(int, int): # @f2(int, int)
lea eax, [rsi + rdi]
mov ecx, edi
sub ecx, esi
imul eax, ecx
ret
All just permutation of the same 4 opcodes each.每个都只是相同的 4 个操作码的排列。 You are trading an
imul
for an add
.您正在用
imul
换取add
。 Which might be faster, or rather have more execution units running in parallel.这可能会更快,或者更确切地说有更多的执行单元并行运行。
The clang f2
I find most interesting because it uses the address calculation unit instead of the arithmetic adder. clang
f2
我觉得最有趣,因为它使用地址计算单元而不是算术加法器。 So all 4 opcodes use different execution units.所以所有 4 个操作码都使用不同的执行单元。
Now contrast that with ARM/ARM64:现在将其与 ARM/ARM64 进行对比:
ARM MSVC ARM MSVC
|int f1(int,int)| PROC ; f1
mul r2,r0,r0
mul r3,r1,r1
subs r0,r2,r3
|$M4|
bx lr
ENDP ; |int f1(int,int)|, f1
|int f2(int,int)| PROC ; f2
subs r2,r0,r1
adds r3,r0,r1
mul r0,r2,r3
|$M4|
bx lr
ENDP ; |int f2(int,int)|, f2
ARM64 msvc ARM64 msvc
|int f1(int,int)| PROC ; f1
mul w8,w0,w0
msub w0,w1,w1,w8
ret
ENDP ; |int f1(int,int)|, f1
|int f2(int,int)| PROC ; f2
sub w9,w0,w1
add w8,w0,w1
mul w0,w9,w8
ret
ENDP ; |int f2(int,int)|, f2
ARM gcc 12.1 ARM GCC 12.1
f1(int, int):
mul r0, r0, r0
mls r0, r1, r1, r0
bx lr
f2(int, int):
subs r3, r0, r1
add r0, r0, r1
mul r0, r3, r0
bx lr
ARM64 gcc 12.1 ARM64 gcc 12.1
f1(int, int):
mul w0, w0, w0
msub w0, w1, w1, w0
ret
f2(int, int):
sub w2, w0, w1
add w0, w0, w1
mul w0, w2, w0
ret
ARM clang 11.0.1 ARM 铿锵声 11.0.1
f1(int, int):
mul r2, r1, r1
mul r1, r0, r0
sub r0, r1, r2
bx lr
f2(int, int):
add r2, r1, r0
sub r1, r0, r1
mul r0, r1, r2
bx lr
ARM64 clang 11.0.1 ARM64 铿锵声 11.0.1
f1(int, int): // @f1(int, int)
mul w8, w1, w1
neg w8, w8
madd w0, w0, w0, w8
ret
f2(int, int): // @f2(int, int)
sub w8, w0, w1
add w9, w1, w0
mul w0, w8, w9
ret
All compilers have eliminated the mov
instruction since there is more choice of what input and output registers to use.所有编译器都取消了
mov
指令,因为有更多的输入和输出寄存器可供选择。 But there is a big difference in the generated codes.但是生成的代码有很大的不同。 Not all compilers seem to know that ARM/ARM64 has a multiply-and-subtract opcode.
并非所有编译器似乎都知道 ARM/ARM64 具有乘法和减法操作码。 clang seems to know about multiply-and-addition though.
clang 似乎知道乘法和加法。
Now the question becomes: Is a mls
faster or slower than add
+ sub
.现在问题变成了: a
mls
比add
+ sub
快还是慢。 With gcc f1
seems to be better, with msvc only for arm64 and clang I think is undecided.使用 gcc
f1
似乎更好,使用 msvc 仅适用于 arm64 和 clang 我认为尚未决定。
And now for something completely different:现在来点完全不同的东西:
AVR gcc 11.1.0 AVR gcc 11.1.0
f1(int, int):
mov r19,r22
mov r18,r23
mov r22,r24
mov r23,r25
rcall __mulhi3
mov r31,r25
mov r30,r24
mov r24,r19
mov r25,r18
mov r22,r19
mov r23,r18
rcall __mulhi3
mov r19,r31
mov r18,r30
sub r18,r24
sbc r19,r25
mov r25,r19
mov r24,r18
ret
f2(int, int):
mov r18,r22
mov r19,r23
mov r23,r25
mov r22,r24
add r22,r18
adc r23,r19
sub r24,r18
sbc r25,r19
rcall __mulhi3
ret
I think there is no argument that f2
is worlds better.我认为没有人认为
f2
比世界更好。
PS: Beware that the 2 functions are not equivalent. PS:请注意,这两个功能是不等价的。 Their behavior differs with overflows.
它们的行为因溢出而异。 Or rather when they overflow.
或者更确切地说,当它们溢出时。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.