简体   繁体   English

add vs mul(IA32-Assembly)

[英]add vs mul (IA32-Assembly)

I know that add is faster as compared to mul function. 我知道与mul函数相比, add更快。

I want to know how to go about using add instead of mul in the following code in order to make it more efficient. 我想知道如何在下面的代码中使用add而不是mul来提高效率。

Sample code: 示例代码:

            mov eax, [ebp + 8]              #eax = x1
            mov ecx, [ebp + 12]             #ecx = x2
            mov edx, [ebp + 16]             #edx = y1
            mov ebx, [ebp + 20]             #ebx = y2

            sub eax,ecx                     #eax = x1-x2
            sub edx,ebx                     #edx = y1-y2

            mul edx                         #eax = (x1-x2)*(y1-y2)

add is faster than mul , but if you want to multiply two general values, mul is far faster than any loop iterating add operations. addmul快,但是如果你想要乘以两个通用值, mul比任何循环迭代添加操作要快得多。

You can't seriously use add to make that code go faster than it will with mul . 您不能认真使用add来使代码变得比使用mul更快。 If you needed to multiply by some small constant value (such as 2), then maybe you could use add to speed things up. 如果你需要乘以一些小的常数值(比如2),那么也许你可以使用add来加快速度。 But for the general case - no. 但对于一般情况 - 没有。

If you are multiplying two values that you don't know in advance, it is effectively impossible to beat the multiply instruction in x86 assembler. 如果要将两个您事先不知道的值相乘,则实际上不可能超过x86汇编程序中的乘法指令。

If you know the value of one of the operands in advance, you may be able beat the multiply instruction by using a small number of adds. 如果您事先知道其中一个操作数的值,则可以通过使用少量添加来击败乘法指令。 This works particularly well when the known operand is small, and only has a few bits in its binary representation. 当已知操作数很小并且在其二进制表示中仅具有几个位时,这尤其有效。 To multiply an unknown value x by a known value consisting 2^p+2^q+...2^r you simply add x*2^p+x*2^q+..x*2*r if bits p,q, ... and r are set. 要将未知值x乘以包含2 ^ p + 2 ^ q + ... 2 ^ r的已知值,您只需添加x * 2 ^ p + x * 2 ^ q + .. x * 2 * r如果位p,q ,...和r已设定。 This is easily accomplished in assembler by left shifting and adding: 这可以通过左移和添加在汇编程序中轻松完成:

;  x in EDX
;  product to EAX
xor  eax,eax
shl  edx,r ; x*2^r
add  eax,edx
shl  edx,q-r ; x*2^q
add  eax,edx
shl  edx,p-q ; x*2^p
add  eax,edx

The key problem with this is that it takes at least 4 clocks to do this, assuming a superscalar CPU constrained by register dependencies. 这个问题的关键问题是,假设超标量CPU受寄存器依赖性约束,它至少需要4个时钟才能完成。 Multiply typically takes 10 or fewer clocks on modern CPUs, and if this sequence gets longer than that in time you might as well do a multiply. 乘法在现代CPU上通常需要10个或更少的时钟,如果这个序列比时间长,你也可以进行乘法运算。

To multiply by 9: 乘以9:

mov  eax,edx ; same effect as xor eax,eax/shl edx 1/add eax,edx
shl  edx,3 ; x*2^3
add  eax,edx

This beats multiply; 这节拍倍增; should only take 2 clocks. 应该只需2个时钟。

What is less well known is the use of the LEA (load effective address) instruction, to accomplish fast multiply-by-small-constant. 不太为人所知的是使用LEA(加载有效地址)指令来实现快速乘以小常数。 LEA which takes only a single clock worst case its execution time can often by overlapped with other instructions by superscalar CPUs. LEA只采用单个时钟最坏的情况,其执行时间通常可以通过超标量CPU与其他指令重叠。

LEA is essentially "add two values with small constant multipliers". LEA本质上是“用小常数乘法器加两个值”。 It computes t=2^k*x+y for k=1,2,3 (see the Intel reference manual) for t, x and y being any register. 它计算t = 2 ^ k * x + y为k = 1,2,3(参见英特尔参考手册),t,x和y为任何寄存器。 If x==y, you can get 1,2,3,4,5,8,9 times x, but using x and y as seperate registers allows for intermediate results to be combined and moved to other registers (eg, to t), and this turns out to be remarkably handy. 如果x == y,则可以获得1,2,3,4,5,8,9倍x,但使用x和y作为单独的寄存器允许将中间结果组合移动到其他寄存器(例如,到t) ),结果非常方便。 Using it, you can accomplish a multiply by 9 using a single instruction: 使用它,您可以使用单个指令完成乘以9:

lea  eax,[edx*8+edx]  ; takes 1 clock

Using LEA carefully, you can multiply by a variety of peculiar constants in a small number of cycles: 仔细使用LEA,您可以在少数周期中乘以各种特殊常数:

lea  eax,[edx*4+edx] ; 5 * edx
lea  eax,[eax*2+edx] ; 11 * edx
lea  eax,[eax*4] ; 44 * edx

To do this, you have to decompose your constant multiplier into various factors/sums involving 1,2,3,4,5,8 and 9. It is remarkable how many small constants you can do this for, and still only use 3-4 instructions. 要做到这一点,你必须将你的常数乘数分解为涉及1,2,3,4,5,8和9的各种因子/总和。值得注意的是你可以做多少小常数,并且仍然只使用3- 4条说明。

If you allow the use other typically single-clock instructions (eg, SHL/SUB/NEG/MOV) you can multiply by some constant values that pure LEA can't do as efficiently by itself. 如果允许使用其他典型的单时钟指令(例如,SHL / SUB / NEG / MOV),则可以乘以纯LEA无法自行完成的某些常数值。 To multiply by 31: 乘以31:

lea  eax,[4*edx]
lea  eax,[8*eax]  ; 32*edx
sub  eax,edx; 31*edx ; 3 clocks

The corresponding LEA sequence is longer: 相应的LEA序列更长:

lea  eax,[edx*4+edx]
lea  eax,[edx*2+eax] ; eax*7
lea  eax,[eax*2+edx] ; eax*15
lea  eax,[eax*2+edx] ; eax*31 ; 4 clocks

Figuring out these sequences is a bit tricky, but you can set up an organized attack. 弄清楚这些序列有点棘手,但您可以设置有组织的攻击。

Since LEA, SHL, SUB, NEG, MOV are all single-clock instructions worst case, and zero clocks if they have no dependences on other instructions, you can compute the exeuction cost of any such sequence. 由于LEA,SHL,SUB,NEG,MOV都是最差情况下的单时钟指令,如果它们不依赖于其他指令,则零时钟,您可以计算任何此类序列的执行成本。 This means you can implement a dynamic programmming algorithm to generate the best possible sequence of such instructions. 这意味着您可以实现动态编程算法,以生成此类指令的最佳序列。 This is only useful if the clock count is smaller than the integer multiply for your particular CPU (I use 5 clocks as rule of thumb), and it doesn't use up all the registers, or at least it doesn't use up registers that are already busy (avoiding any spills). 这仅在时钟计数小于特定CPU的整数乘法时才有用(我使用5个时钟作为经验法则), 并且它不会耗尽所有寄存器,或者至少它不会使用寄存器已经很忙(避免任何溢出)。

I've actually built this into our PARLANSE compiler, and it is very effective for computing offsets into arrays of structures A[i], where the size of the structure element in A is the known constant. 我实际上将它构建到我们的PARLANSE编译器中,它非常有效地计算结构A [i]的数组的偏移量,其中A中结构元素的大小是已知常量。 A clever person would possibly cache the answer so it doesn't have to be recomputed each time multiplying the same constant occurs; 一个聪明的人可能会缓存答案,因此每次乘以相同的常数时都不必重新计算; I didn't actually do that because the time to generate such sequences is less than you'd expect. 我实际上并没有这样做,因为生成此类序列的时间少于您的预期。

Its is mildly interesting to print out the sequences of instructions needed to multiply by all constants from 1 to 10000. Most of them can be done in 5-6 instructions worst case. 有趣的是打印出所有常数乘以1到10000所需的指令序列。大多数指令可以在最坏情况下的5-6指令中完成。 As a consequence, the PARLANSE compiler hardly ever uses an actual multiply when indexing even the nastiest arrays of nested structures. 因此,PARLANSE编译器甚至在索引甚至最糟糕的嵌套结构数组时也几乎不使用实际的乘法。

Unless your multiplications are fairly simplistic, the add most likely won't outperform a mul . 除非你的乘法是相当简单的add很可能不会跑赢一个mul Having said that, you can use add to do multiplications: 话虽如此,你可以使用add来做乘法:

Multiply by 2:
    add eax,eax          ; x2
Multiply by 4:
    add eax,eax          ; x2
    add eax,eax          ; x4
Multiply by 8:
    add eax,eax          ; x2
    add eax,eax          ; x4
    add eax,eax          ; x8

They work nicely for powers of two. 他们很适合两个人的力量。 I'm not saying they're faster. 我不是说他们更快。 They were certainly necessary in the days before fancy multiplication instructions. 在花哨的乘法指令之前的几天,它们肯定是必要的。 That's from someone whose soul was forged in the hell-fires that were the Mostek 6502, Zilog z80 and RCA1802 :-) 这是来自一个人的灵魂是在地狱火中伪造的人,那就是Mostek 6502,Zilog z80和RCA1802 :-)

You can even multiply by non-powers by simply storing interim results: 您甚至可以通过简单地存储中间结果来乘以非权力:

Multiply by 9:
    push ebx              ; preserve
    push eax              ; save for later
    add  eax,eax          ; x2
    add  eax,eax          ; x4
    add  eax,eax          ; x8
    pop  ebx              ; get original eax into ebx
    add  eax,ebx          ; x9
    pop  ebx              ; recover original ebx

I generally suggest that you write your code primarily for readability and only worry about performance when you need it. 我通常建议您编写代码主要是为了提高可读性,并且只在需要时担心性能。 However, if you're working in assembler, you may well already at that point. 但是,如果您在汇编程序中工作,那么您可能已经那时。 But I'm not sure my "solution" is really applicable to your situation since you have an arbitrary multiplicand. 但我不确定我的“解决方案”是否真的适用于你的情况,因为你有一个任意的被乘数。

You should , however, always profile your code in the target environment to ensure that what you're doing is actually faster. 但是,您应该始终在目标环境中分析您的代码,以确保您正在执行的操作实际上更快。 Assembler doesn't change that aspect of optimisation at all. 汇编程序根本不会改变优化的那个方面。


If you really want to see some more general purpose assembler for using add to do multiplication, here's a routine that will take two unsigned values in ax and bx and return the product in ax . 如果你真的想看到一些更通用的汇编程序的使用add做乘法,这里是一个将采取两个无符号值的例行axbx ,作为回报,该产品ax It will not handle overflow elegantly. 它不会优雅地处理溢出。

START:  MOV    AX, 0007    ; Load up registers
        MOV    BX, 0005
        CALL   MULT        ; Call multiply function.
        HLT                ; Stop.

MULT:   PUSH   BX          ; Preserve BX, CX, DX.
        PUSH   CX
        PUSH   DX

        XOR    CX,CX       ; CX is the accumulator.

        CMP    BX, 0       ; If multiplying by zero, just stop.
        JZ     FIN

MORE:   PUSH   BX          ; Xfer BX to DX for bit check.
        POP    DX

        AND    DX, 0001    ; Is lowest bit 1?
        JZ     NOADD       ; No, do not add.
        ADD    CX,AX

NOADD:  SHL    AX,1        ; Shift AX left (double).
        SHR    BX,1        ; Shift BX right (integer halve, next bit).
        JNZ    MORE        ; Keep going until no more bits in BX.

FIN:    PUSH   CX          ; Xfer product from CX to AX.
        POP    AX

        POP    DX          ; Restore registers and return.
        POP    CX
        POP    BX
        RET

It relies on the fact that 123 multiplied by 456 is identical to: 它依赖于123乘以456的事实与:

    123 x 6
+  1230 x 5
+ 12300 x 4

which is the same way you were taught multiplication back in grade/primary school. 这与你在小学/小学教授乘法的方式相同。 It's easier with binary since you're only ever multiplying by zero or one (in other words, either adding or not adding). 使用二进制文件更容易,因为您只需要乘以零或一(换句话说,添加或不添加)。

It's pretty old-school x86 (8086, from a DEBUG session - I can't believe they still actually include that thing in XP) since that was about the last time I coded directly in assembler. 这是非常古老的学校x86(8086,来自DEBUG会议 - 我不敢相信他们实际上仍然在XP中包含那个东西),因为这是我最后一次直接在汇编程序中编码。 There's something to be said for high level languages :-) 高级语言有一些东西可以说:-)

When it comes to assembly instruction,speed of executing any instruction is measured using the clock cycle. 在汇编指令中,使用时钟周期测量执行任何指令的速度。 Mul instruction always take more clock cycle's then add operation,but if you execute the same add instruction in a loop then the overall clock cycle to do multiplication using add instruction will be way more then the single mul instruction. Mul指令总是花费更多的时钟周期然后添加操作,但是如果在循环中执行相同的add指令,则使用add指令进行乘法的整个时钟周期将比单mul指令更多。 You can have a look on the following URL which talks about the clock cycle of single add/mul instruction.So that way you can do your math,which one will be faster. 您可以查看以下URL,其中讨论了单个add / mul指令的时钟周期。因此,您可以进行数学运算,哪一个会更快。

http://home.comcast.net/~fbui/intel_a.html#add http://home.comcast.net/~fbui/intel_a.html#add

http://home.comcast.net/~fbui/intel_m.html#mul http://home.comcast.net/~fbui/intel_m.html#mul

My recommendation is to use mul instruction rather then putting add in loop,the later one is very inefficient solution. 我的建议是使用mul指令而不是添加循环,后者是非常低效的解决方案。

I'd have to echo the responses you have already - for a general multiply you're best off using MUL - after all it's what it's there for! 我必须回应你已经做出的反应 - 对于一般的倍增你最好使用MUL - 毕竟它就是它的用途!

In some specific cases, where you know you'll be wanting to multiply by a specific fixed value each time (for example, in working out a pixel index in a bitmap) then you can consider breaking the multiply down into a (small) handful of SHLs and ADDs - eg: 在某些特定情况下,您知道每次都希望乘以特定的固定值(例如,在位图中计算出像素索引),那么您可以考虑将乘法数减少到(小)一小部分SHL和ADD - 例如:

1280 x 1024 display - each line on the display is 1280 pixels. 1280 x 1024显示屏 - 显示屏上的每一行为1280像素。

1280 = 1024 + 256 = 2^10 + 2^8 1280 = 1024 + 256 = 2 ^ 10 + 2 ^ 8

y * 1280 = y * (2 ^ 10) + y * (2 ^ 8) = ADD (SHL y, 10), (SHL y, 8) y * 1280 = y *(2 ^ 10)+ y *(2 ^ 8)= ADD(SHL y,10),(SHL y,8)

...given that graphics processing is likely to need to be speedy, such an approach may save you precious clock cycles. ...鉴于图形处理可能需要快速,这种方法可以节省宝贵的时钟周期。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM