简体   繁体   English

程序集8086 - 在没有MUL和DIV指令的情况下实现任何乘法和除法

[英]Assembly 8086 - Implementing any multiplication and division without MUL and DIV instruction

I would like to know if there is a way to perform any multiplication or division without use of MUL or DIV instruction because they require a lot of CPU cycles. 我想知道是否有办法在不使用MUL或DIV指令的情况下执行任何乘法或除法,因为它们需要大量的CPU周期。 Can I exploit SHL or SHR instructions for this target? 我可以为此目标利用SHL或SHR指令吗? How can I implement the assembly code? 如何实现汇编代码?

Just like everything else in assembly there are many ways to do multiplication and division. 就像汇编中的其他所有东西一样,有许多方法可以进行乘法和除法。

  1. Do division by multiplying by the reciprocal value. 通过乘以倒数值来划分。
  2. Use shifts and adds/subs instead of multiplication. 使用shift并添加/ subs而不是乘法。
  3. Use the address calculation options of lea (multiplication only). 使用lea的地址计算选项(仅乘法)。

Myth busting 神话破灭

because they require a lot of CPU cycles 因为它们需要大量的CPU周期

MUL and IMUL are blazingly fast on modern CPU's, see: http://www.agner.org/optimize/instruction_tables.pdf MULIMUL在现代CPU上的速度非常快,请参阅: http//www.agner.org/optimize/instruction_tables.pdf
DIV and IDIV are and always have been exceedingly slow. DIVIDIV一直都非常慢。

An example for Intel Skylake (page 217): 英特尔Skylake的示例(第217页):

MUL, IMUL r64: Latency 3 cycles, reciprocal throughput 1 cycle. MUL,IMUL r64:延迟3个周期,相互吞吐量1个周期。

Note that this is the maximum latency to multiply two 64 ! 请注意,这是乘以两个64的最大延迟! bit values. 位值。
The CPU can complete one of these multiplications every CPU cycle if all it's doing is multiplications. 如果它所做的全部是乘法,CPU可以在每个CPU周期完成这些乘法之一。
If you consider that the above example using shifts and adds to multiply by 7 has a latency of 4 cycles (3 using lea). 如果你认为上面的例子使用shift并且加上乘以7有一个4个周期的延迟(3个使用lea)。 There is no real way to beat a plain multiply on a modern CPU. 在现代CPU上没有真正的方法可以击败普通的倍数。

Multiplication by the reciprocal 乘以相互的乘法

According to Agner Fog's asm lib instruction page 12 : 根据Agner Fog的asm lib指令第12页

Division is slow on most microprocessors. 大多数微处理器的分工都很慢。 In floating point calculations, we can do multiple divisions with the same divisor faster by multiplying with the reciprocal, for example: 在浮点计算中,我们可以通过乘以倒数来更快地执行具有相同除数的多个除法,例如:

 float a, b, d; a /= d; b /= d; 

can be changed to: 可以改为:

 float a, b, d, r; r = 1.0f / d; a *= r; b *= r; 

If we want to do something similar with integers then we have to scale the reciprocal divisor by 2n and then shift n places to the right after the multiplication. 如果我们想要用整数做类似的事情,那么我们必须将倒数除数除以2n,然后在乘法后将n位移到右边。

Multiplying by the reciprocal works well when you need to divide by a constant or if you divide by the same variable many times in a row. 当你需要除以一个常数或者你连续多次除以同一个变量时,乘以倒数的效果很好。
You can find really cool assembly code demonstrating the concept in Agner Fog's assembly library . 你可以在Agner Fog的装配库中找到非常酷的装配代码来展示这个概念。

Shifts and adds/subs 移位并添加/替换
A shift right is a divide by two shr - ( R educe). 右移是两个shr - ( R educe)。
A shift left is a multiply by two shl - ( L arger). 左移是一个乘以两个shl - ( L arger)。
You can add and substract to correct for non-powers of two along the way. 您可以添加和减去以一路纠正两个非幂。

//Multiply by 7
mov ecx,eax
shl eax,3    //*8
sub eax,ecx  //*7

Division other than by powers of 2 using this method gets complex quickly. 使用这种方法除2的幂以外的除法很快变得复杂。
You may wonder why I'm doing the operations in a weird order, but I'm trying to make the dependency chain as short as possible to maximize the number of instructions that can be executed in parallel. 您可能想知道为什么我以奇怪的顺序执行操作,但我正在尝试使依赖链尽可能短,以最大化可以并行执行的指令数。

Using Lea 使用Lea
Lea is an instruction to calculate address offsets. Lea是计算地址偏移的指令。
It can calculate multiples of 2,3,4,5,8, and 9 in a single instruction. 它可以在单个指令中计算2,3,4,5,8和9的倍数。
Like so: 像这样:

                      //Latency on AMD CPUs (K10 and later, including Jaguar and Zen)
                      //On Intel all take 1 cycle.
lea eax,[eax+eax]     //*2     1 cycle      
lea eax,[eax*2+eax]   //*3     2 cycles
lea eax,[eax*4]       //*4     2 cycles   more efficient: shl eax,2 (1 cycle)
lea eax,[eax*4+eax]   //*5     2 cycles 
lea eax,[eax*8]       //*8     2 cycles   more efficient: shl eax,3 (1 cycle)
lea eax,[eax*8+eax]   //*9     2 cycles

Note however that lea with a multiplier (scale factor) is considered a 'complex' instruction on AMD CPUs from K10 to Zen and has a latency of 2 CPU cycles. 但请注意,带有乘数(比例因子)的lea被认为是AMD CPU从K10到Zen的“复杂”指令,并且具有2个CPU周期的延迟。 On earlier AMD CPUs (k8), lea always has 2-cycle latency even with a simple [reg+reg] or [reg+disp8] addressing mode. 在早期的AMD CPU(k8)上,即使使用简单的[reg+reg][reg+disp8]寻址模式, lea总是具有2周期延迟。

AMD AMD
Agner Fog's instruction tables are wrong for AMD Zen: 3-component or scaled-index LEA is still 2 cycles on Zen (with only 2 per clock throughput instead of 4) according to InstLatx64 ( http://instlatx64.atw.hu/ ). 对于AMD Zen来说,Agner Fog的指令表是错误的:根据InstLatx64( http://instlatx64.atw.hu/ ),3组件或缩放索引LEA在Zen上仍然是2个周期(每个时钟吞吐量只有2个而不是4个) 。 Also, like earlier CPUs, in 64-bit mode lea r32, [r64 + whatever] has 2 cycle latency. 此外,与早期的CPU一样,在64位模式lea r32, [r64 + whatever]有2个周期延迟。 So it's actually faster to use lea rdx, [rax+rax] instead of lea edx, [rax+rax] on AMD CPUs, unlike Intel where truncating the result to 32 bits is free. 所以在AMD CPU上使用lea rdx, [rax+rax]而不是lea edx, [rax+rax]实际上更快,不像英特尔那样将结果截断为32位是免费的。

The *4 and *8 can be done faster using shl because a simple shift takes only a single cycle. 使用shl可以更快地完成* 4和* 8,因为简单的移位只需要一个周期。

On the plus side, lea does not alter the flags and it allows a free move to another destination register. 在正面, lea不会改变标志,它允许自由移动到另一个目的地寄存器。 Because lea can only shift left by 0, 1, 2, or 3 bits (aka multiply by 1, 2, 4, or 8) these are the only breaks you get. 因为lea只能向左移动0,1,2或3位(也就是乘以1,2,4或8),所以这是你得到的唯一中断。

Intel 英特尔
On Intel CPUs (Sandybridge-family), any 2-component LEA (only one + ) has single-cycle latency. 在Intel CPU(Sandybridge系列)上,任何双组件LEA(仅一个+ )都具有单周期延迟。 So lea edx, [rax + rax*4] has single-cycle latency, but lea edx, [rax + rax + 12] has 3 cycle latency (and worse throughput). 所以lea edx, [rax + rax*4]具有单周期延迟,但是lea edx, [rax + rax + 12]具有3个周期延迟(和更差的吞吐量)。 An example of this tradeoff is discussed in detail in C++ code for testing the Collatz conjecture faster than hand-written assembly - why? C ++代码中详细讨论了这种权衡的一个例子, 用于比手写汇编更快地测试Collat​​z猜想 - 为什么? .

Things like SHL/SHR, SAL/SAR, ADD/SUB are faster than MUL and DIV, but MUL and DIV work better for dynamic numbers. 像SHL / SHR,SAL / SAR,ADD / SUB这样的东西比MUL和DIV快,但MUL和DIV对于动态数字更好。 For example, if you know that you just need to divide by two, then it's a single-bit shift right. 例如,如果你知道你只需要除以2,那么它就是一个单位右移。 But if you don't know in advance the number, then you might be tempted to repeatedly SUB the values. 但是如果你事先并不知道这个数字,那么你可能会想要重复SUB值。 For example, To determine AX divided by BX, you could just constantly subtract BX from AX until BX is > AX, keeping track of the count. 例如,要确定AX除以BX,您可以不断地从AX中减去BX,直到BX> AX,跟踪计数。 But if you were dividing by 200, by 1 that would mean 200 loops and SUB operations. 但是,如果你除以200,则表示200次循环和SUB操作。

MUL and DIV will work better in most cases when the numbers involved aren't hard-coded and known in advance. 在大多数情况下,如果涉及的数字不是硬编码的并且事先已知,则MUL和DIV将更好地工作。 The only exceptions I can think of is when you know it's something like a multiple/divide by 2, 4, 8, etc. where the Shift operators will work fine. 我能想到的唯一例外是当你知道它是多次/除2,4,8等等时,Shift运算符可以正常工作。

Implementing multiplication is easier, if you remember, an shl operation performs the same operation as multiplying the specified operand by two. 实现乘法更容易,如果你还记得,shl操作执行与将指定的操作数乘以2相同的操作。 Shifting to the left two bit positions multiplies the operand by four. 向左移动两位位置将操作数乘以四。 Shifting to the left three bit positions multiplies the operand by eight. 向左移动三位位置将操作数乘以八。 In general, shifting an operand to the left n bits multiplies it by 2n. 通常,将操作数移位到左侧n位将其乘以2n。 Any value can be multiplied by some constant using a series of shifts and adds or shifts and subtractions. 任何值都可以乘以一些常数,使用一系列的移位和加法或移位和减法。 For example, to multiply the ax register by ten, you need only multiply it by eight and then add in two times the original value. 例如,要将ax寄存器乘以10,您只需要将它乘以8然后再加上原始值的两倍。 That is, 10*ax = 8*ax + 2*ax. 也就是说,10 * ax = 8 * ax + 2 * ax。 The code to accomplish this is 完成此任务的代码是

            shl     ax, 1           ;Multiply AX by two
            mov     bx, ax          ;Save 2*AX for later
            shl     ax, 1           ;Multiply AX by four
            shl     ax, 1           ;Multiply AX by eight
            add     ax, bx          ;Add in 2*AX to get 10*AX

The ax register (or just about any register, for that matter) can be multiplied by most constant values much faster using shl than by using the mul instruction. 使用shl比使用mul指令更快地将ax寄存器(或几乎任何寄存器)乘以大多数常量值。 This may seem hard to believe since it only takes two instructions to compute this product: 这似乎很难相信,因为它只需要两条指令来计算这个产品:

            mov     bx, 10
            mul     bx

However, if you look at the timings, the shift and add example above requires fewer clock cycles on most processors in the 80x86 family than the mul instruction. 但是,如果查看时序,上面的移位和添加示例要求80x86系列中大多数处理器的时钟周期少于mul指令。 Of course, the code is somewhat larger (by a few bytes), but the performance improvement is usually worth it. 当然,代码有点大(几个字节),但性能提升通常是值得的。 Of course, on the later 80x86 processors, the mul instruction is quite a bit faster than the earlier processors, but the shift and add scheme is generally faster on these processors as well. 当然,在后来的80x86处理器上,mul指令比早期的处理器快得多,但移位和添加方案在这些处理器上通常也更快。

You can also use subtraction with shifts to perform a multiplication operation. 您还可以使用带有移位的减法来执行乘法运算。 Consider the following multiplication by seven: 考虑以下乘以7:

            mov     bx, ax          ;Save AX*1
            shl     ax, 1           ;AX := AX*2
            shl     ax, 1           ;AX := AX*4
            shl     ax, 1           ;AX := AX*8
            sub     ax, bx          ;AX := AX*7

This follows directly from the fact that ax*7 = (ax*8)-ax. 这直接来自ax * 7 =(ax * 8)-ax的事实。

A common error made by beginning assembly language students is subtracting or adding one or two rather than ax*1 or ax*2. 初学汇编语言学生的一个常见错误是减去或增加一个或两个而不是ax * 1或ax * 2。 The following does not compute ax*7: 以下不计算ax * 7:

            shl     ax, 1
            shl     ax, 1
            shl     ax, 1
            sub     ax, 1

It computes (8*ax)-1, something entirely different (unless, of course, ax = 1). 它计算(8 * ax)-1,完全不同(当然,除非ax = 1)。 Beware of this pitfall when using shifts, additions, and subtractions to perform multiplication operations. 使用移位,加法和减法来执行乘法运算时要小心这个缺陷。

Division is a bit harder, need to think... 分工有点难,需要思考......

Here is an example: 这是一个例子:

mov bx, 1000b
shl bx, 5
mov cx, bx
shr cx, 2
add bx, cx
add bx, 1000b

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM