简体   繁体   中英

assembly 8086 multiply 41 without using MUL

I would like to know if there is a way to perform any multiplication or division without use of MUL or DIV instruction because they require a lot of CPU cycles. Can I exploit SHL or SHR instructions for this target? How can I implement the assembly code?

I need help with a specific number - how can i multiply bx by 41 with only 5 commands???

whenever i try solving the problem , i get minimum 6 commands...

my code:

    mov ax,bx
    mov cx,bx
    shl bx,5    ;  *32
    shl ax,3    ;  *8
    add bx,ax   ; *40 
    add bx,cx   ; *41
; ax = x
mov bx, ax     ; bx = x
shl bx, 3      ; bx = 8 * x
add ax, bx     ; ax = 9 * x
shl bx, 2      ; bx = 32 * x
add ax, bx     ; ax = 41 * x

What CPUs are you tuning for? Do you really mean actual 8086? They still exist as microcontrollers, but the vast majority of x86 code these days runs on modern x86.

Modern x86 CPUs have very faster multipliers, making it usually only worth it to use shift/add or LEA when you can get the job done in 2 uops or fewer. div / idiv are still slow, but multiply isn't in modern CPUs that throw enough transistors at the problem. (Multiply by adding partial products parallelizes nicely in HW, division is inherently serial.)

imul eax, ebx, 41 has 3 cycle latency, 1 per clock throughput, on modern Intel CPUs, and Ryzen ( https://agner.org/optimize/ ) , and is supported on 186 and later. (The 16-bit form imul ax, bx, 41 is 2 uops instead of 1, with 4 cycle latency on Sandybridge-family CPUs. And a false dependency on the full EAX for merging into the low half)


If you can use 32-bit addressing modes (386 and later), you can do it in 2 LEA instructions (so a total of 2 uops, 2 cycle latency on modern CPUs).

Look at how gcc/clang compile this function ( on the Godbolt compiler explorer ):

int times41(int x) { return x*41; }

# compiled for 32-bit with gcc -O3 -m32 -mregparm=1
times41(int):  # first arg in EAX
    lea     edx, [eax+eax*4]      # edx = eax*5
    lea     eax, [eax+edx*8]      # eax = eax + edx*8 =  x + x*40
    ret

This is your best bet for older CPUs where imul or mul take more uops, and if latency is more important than uop count on modern CPUs.

In your 16-bit code (on a 386-compatible), you could use

    lea     eax, [ebx+ebx*4]     # ax = bx*5
    lea     ax, [ebx+eax*8]      # ax = bx + ax*8 =  x + x*40

Using 32-bit operand-size for the first LEA avoids a false dependency on the old value of EAX, and avoids a partial-register stall on Nehalem and earlier (from the 2nd LEA reading EAX after writing AX).

It only costs 1 extra byte of code-size for the operand-size prefix (as well as the address-size prefix), and makes no difference for correctness. (The low 16 bits of left-shift and add results don't depend on the high bits of the input.)

Or you might want to xor eax,eax before writing AX, letting the Intel CPUs avoid partial-register merging for future use of AX. ( Why doesn't GCC use partial registers? ).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM