简体   繁体   English

C++:a*ab*b 与 (a+b)*(ab) 什么计算速度更快?

[英]C++: a*a-b*b vs (a+b)*(a-b) what is faster to compute?

Which way of computing difference of squares in C++ is faster: a*ab*b or (a+b)*(ab) ?在 C++ 中计算平方差的哪种方法更快: a*ab*b(a+b)*(ab) The first expression uses two multiplications and one addition, while the second one needs two additions and one multiplication.第一个表达式使用两次乘法和一次加法,而第二个表达式需要两次加法和一次乘法。 So the second approach seems faster.所以第二种方法似乎更快。 On the other hand, the number of loads of data to registers in the first approach is smaller, and this might compensate one multiplication vs addition.另一方面,在第一种方法中加载到寄存器的数据数量较少,这可能会补偿一个乘法与加法。

If you run this code如果您运行此代码

#include <iostream>
int main()
{
    int a = 6, b = 7;
    int c1 = a*a-b*b;
    int c2 = (a-b)*(a+b);
    return 0;
}

say here and without optimization flags -O, then the number of assembler instruction will be the same:在这里说并且没有优化标志-O,那么汇编指令的数量将是相同的:

for the line: int c1 = a*ab*b;对于该行: int c1 = a*ab*b; :

 mov    eax,DWORD PTR [rbp-0x4]
 imul   eax,eax
 mov    edx,eax
 mov    eax,DWORD PTR [rbp-0x8]
 imul   eax,eax
 sub    edx,eax
 mov    DWORD PTR [rbp-0xc],edx

for the line: int c2 = (ab)*(a+b);对于该行: int c2 = (ab)*(a+b); :

 mov    eax,DWORD PTR [rbp-0x4]
 sub    eax,DWORD PTR [rbp-0x8]
 mov    ecx,DWORD PTR [rbp-0x4]
 mov    edx,DWORD PTR [rbp-0x8]
 add    edx,ecx
 imul   eax,edx
 mov    DWORD PTR [rbp-0x10],eax

On the other hand, the first collection of instructions contains 4 operations which are produced only between registers, while for the second collection only 2 such operations between registers are presented, and the others use memory and registers.另一方面,第一个指令集合包含 4 个仅在寄存器之间产生的操作,而对于第二个集合,仅提供了 2 个寄存器之间的此类操作,其他指令使用内存和寄存器。

So the question is also whether it is possible to estimate which of collections of instructions is faster?所以问题也是是否可以估计哪个指令集合更快?


Added after answers.答案后添加。

Thank you for responding I found the answer.感谢您的回复,我找到了答案。 Look at the following code :看下面的代码

#include <iostream>

int dsq1(int a, int b) 
{
    return a*a-b*b;
};


int dsq2(int a, int b) 
{
    return (a+b)*(a-b);
};

int main()
{
    int a,b;
    // just to be sure that the compiler does not know
    // precise values of a and b and will not optimize them
    std::cin >> a; 
    std::cin >> b; 
    volatile int c1 = dsq1(a,b);
    volatile int c2 = dsq2(a,b);
    return 0;
}

Now the first function for a*ab*b takes the following 5 assembler instructions with two multiplications:现在a*ab*b的第一个函数采用以下 5 条汇编指令和两次乘法:

 mov    esi,eax
 mov    ecx,edx
 imul   esi,eax
 imul   ecx,edx
 sub    ecx,esi

while (ab)*(a+b) takes only 4 instructions and only one multiplication:(ab)*(a+b)只需要 4 条指令和一次乘法:

 mov    ecx,edx
 sub    ecx,eax
 add    eax,edx
 imul   eax,ecx

It seems that (ab)*(a+b) should be faster than a*ab*b .似乎(ab)*(a+b)应该比a*ab*b快。

Now this really depends on the compiler and architecture.现在这真的取决于编译器和架构。 Lets look these two functions:让我们看看这两个函数:

int f1(int a, int b) {
    return a*a-b*b;
}

int f2(int a, int b) {
    return (a-b)*(a+b);
}

Lets look what that produces on x86_64:让我们看看在 x86_64 上产生了什么:

MSVC MSVC

a$ = 8
b$ = 16
int f1(int,int) PROC                                 ; f1, COMDAT
        imul    ecx, ecx
        imul    edx, edx
        sub     ecx, edx
        mov     eax, ecx
        ret     0
int f1(int,int) ENDP                                 ; f1

a$ = 8
b$ = 16
int f2(int,int) PROC                                 ; f2, COMDAT
        mov     eax, ecx
        add     ecx, edx
        sub     eax, edx
        imul    eax, ecx
        ret     0
int f2(int,int) ENDP                                 ; f2

gcc 12.1 GCC 12.1

f1(int, int):
        imul    edi, edi
        imul    esi, esi
        mov     eax, edi
        sub     eax, esi
        ret
f2(int, int):
        mov     eax, edi
        add     edi, esi
        sub     eax, esi
        imul    eax, edi
        ret

clang 14.0铿锵声14.0

f1(int, int):                                # @f1(int, int)
        mov     eax, edi
        imul    eax, edi
        imul    esi, esi
        sub     eax, esi
        ret
f2(int, int):                                # @f2(int, int)
        lea     eax, [rsi + rdi]
        mov     ecx, edi
        sub     ecx, esi
        imul    eax, ecx
        ret

All just permutation of the same 4 opcodes each.每个都只是相同的 4 个操作码的排列。 You are trading an imul for an add .您正在用imul换取add Which might be faster, or rather have more execution units running in parallel.这可能会更快,或者更确切地说有更多的执行单元并行运行。

The clang f2 I find most interesting because it uses the address calculation unit instead of the arithmetic adder. clang f2我觉得最有趣,因为它使用地址计算单元而不是算术加法器。 So all 4 opcodes use different execution units.所以所有 4 个操作码都使用不同的执行单元。

Now contrast that with ARM/ARM64:现在将其与 ARM/ARM64 进行对比:

ARM MSVC ARM MSVC

|int f1(int,int)| PROC                           ; f1
        mul         r2,r0,r0
        mul         r3,r1,r1
        subs        r0,r2,r3
|$M4|
        bx          lr

        ENDP  ; |int f1(int,int)|, f1

|int f2(int,int)| PROC                           ; f2
        subs        r2,r0,r1
        adds        r3,r0,r1
        mul         r0,r2,r3
|$M4|
        bx          lr

        ENDP  ; |int f2(int,int)|, f2

ARM64 msvc ARM64 msvc

|int f1(int,int)| PROC                           ; f1
        mul         w8,w0,w0
        msub        w0,w1,w1,w8
        ret

        ENDP  ; |int f1(int,int)|, f1

|int f2(int,int)| PROC                           ; f2
        sub         w9,w0,w1
        add         w8,w0,w1
        mul         w0,w9,w8
        ret

        ENDP  ; |int f2(int,int)|, f2

ARM gcc 12.1 ARM GCC 12.1

f1(int, int):
        mul     r0, r0, r0
        mls     r0, r1, r1, r0
        bx      lr
f2(int, int):
        subs    r3, r0, r1
        add     r0, r0, r1
        mul     r0, r3, r0
        bx      lr

ARM64 gcc 12.1 ARM64 gcc 12.1

f1(int, int):
        mul     w0, w0, w0
        msub    w0, w1, w1, w0
        ret
f2(int, int):
        sub     w2, w0, w1
        add     w0, w0, w1
        mul     w0, w2, w0
        ret

ARM clang 11.0.1 ARM 铿锵声 11.0.1

f1(int, int):
        mul     r2, r1, r1
        mul     r1, r0, r0
        sub     r0, r1, r2
        bx      lr
f2(int, int):
        add     r2, r1, r0
        sub     r1, r0, r1
        mul     r0, r1, r2
        bx      lr

ARM64 clang 11.0.1 ARM64 铿锵声 11.0.1

f1(int, int):                                // @f1(int, int)
        mul     w8, w1, w1
        neg     w8, w8
        madd    w0, w0, w0, w8
        ret
f2(int, int):                                // @f2(int, int)
        sub     w8, w0, w1
        add     w9, w1, w0
        mul     w0, w8, w9
        ret

All compilers have eliminated the mov instruction since there is more choice of what input and output registers to use.所有编译器都取消了mov指令,因为有更多的输入和输出寄存器可供选择。 But there is a big difference in the generated codes.但是生成的代码有很大的不同。 Not all compilers seem to know that ARM/ARM64 has a multiply-and-subtract opcode.并非所有编译器似乎都知道 ARM/ARM64 具有乘法和减法操作码。 clang seems to know about multiply-and-addition though. clang 似乎知道乘法和加法。

Now the question becomes: Is a mls faster or slower than add + sub .现在问题变成了: a mlsadd + sub快还是慢。 With gcc f1 seems to be better, with msvc only for arm64 and clang I think is undecided.使用 gcc f1似乎更好,使用 msvc 仅适用于 arm64 和 clang 我认为尚未决定。

And now for something completely different:现在来点完全不同的东西:

AVR gcc 11.1.0 AVR gcc 11.1.0

f1(int, int):
        mov r19,r22
        mov r18,r23
        mov r22,r24
        mov r23,r25
        rcall __mulhi3
        mov r31,r25
        mov r30,r24
        mov r24,r19
        mov r25,r18
        mov r22,r19
        mov r23,r18
        rcall __mulhi3
        mov r19,r31
        mov r18,r30
        sub r18,r24
        sbc r19,r25
        mov r25,r19
        mov r24,r18
ret
f2(int, int):
        mov r18,r22
        mov r19,r23
        mov r23,r25
        mov r22,r24
        add r22,r18
        adc r23,r19
        sub r24,r18
        sbc r25,r19
        rcall __mulhi3
        ret

I think there is no argument that f2 is worlds better.我认为没有人认为f2比世界更好。

PS: Beware that the 2 functions are not equivalent. PS:请注意,这两个功能是不等价的。 Their behavior differs with overflows.它们的行为因溢出而异。 Or rather when they overflow.或者更确切地说,当它们溢出时。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM