简体   繁体   English

用 8086 汇编语言交换 2 个寄存器(16 位)

[英]swapping 2 registers in 8086 assembly language(16 bits)

Does someone know how to swap the values of 2 registers without using another variable, register, stack, or any other storage location?有人知道如何在不使用其他变量、寄存器、堆栈或任何其他存储位置的情况下交换 2 个寄存器的值吗? thanks!谢谢!

Like swapping AX, BX.就像交换AX,BX。

8086 has an instruction for this: 8086 对此有一个说明:

xchg   ax, bx

If you really need to swap two regs, xchg ax, bx is the most efficient way on all x86 CPUs in most cases , modern and ancient including 8086. (You could construct a case where multiple single-uop instructions might be more efficient because of some other weird front-end effect due to surrounding code. Or for 32-bit operand size, where zero-latency mov made a 3-mov sequence with a temporary register better on Intel CPUs).如果您真的需要交换两个 regs, xchg ax, bx在大多数情况下是所有 x86 CPU 上最有效的方式,现代和古代包括 8086。(您可以构建一个案例,其中多个单指令可能更高效,因为由于周围代码而导致的其他一些奇怪的前端效果。或者对于 32 位操作数大小,零延迟mov在 Intel CPU 上使用临时寄存器更好地制作了 3-mov 序列)。

For code-size;对于代码大小; xchg-with-ax only takes a single byte. xchg-with-ax只需要一个字节。 This is where the 0x90 NOP encoding comes from: it's xchg ax, ax , or xchg eax, eax in 32-bit mode 1 .这就是0x90 NOP编码的来源:它是xchg ax, axxchg eax, eax in 32-bit mode 1 Exchanging any other pair of registers takes 2 bytes for the xchg r, r/m encoding.交换任何其他寄存器对需要 2 个字节用于xchg r, r/m编码。 (+ REX prefix if required in 64-bit mode.) (如果在 64 位模式下需要,则 + REX 前缀。)

On an actual 8086, code-fetch was usually the performance bottleneck, so xchg is by far the best way, especially using the single-byte xchg-with-ax short form.在实际的 8086 上,code-fetch 通常是性能瓶颈,因此xchg迄今为止最好的方法,尤其是使用单字节xchg-with-ax短格式。

Footnote 1: (In 64-bit mode, xchg eax, eax would truncate RAX to 32 bits, so 0x90 is explicitly a nop instruction, not also an xchg ).脚注 1:(在 64 位模式下, xchg eax, eax会将RAX截断为 32 位,因此 0x90 明确是nop指令,而不是xchg )。


For 32-bit / 64-bit registers, 3 mov instructions with a temporary could benefit from mov-elimination where xchg can't on current Intel CPUs.对于 32 位 / 64 位寄存器,3 条带有临时指令的mov指令可以从mov-elimination中受益,而xchg在当前的 Intel CPU 上无法实现。 xchg is 3 uops on Intel, all of them having 1c latency and needing an execution unit, so one direction has 2c latency but the other has 1c latency. xchg在 Intel 上是 3 uops,所有这些都具有 1c 延迟并且需要一个执行单元,因此一个方向具有 2c 延迟,而另一个具有 1c 延迟。 See Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures?请参阅为什么 XCHG reg, reg 是现代英特尔架构上的 3 微操作指令? for more microarchitectural details about how current CPUs implement it.有关当前 CPU 如何实现它的更多微架构细节。

On AMD Ryzen, xchg on 32/64-bit regs is 2 uops and is handled in the rename stage, so it's like two mov instructions that run in parallel.在 AMD Ryzen 上,32/64 位 regs 上的xchg是 2 uop,并在重命名阶段处理,因此它就像两个并行运行的mov指令。 On earlier AMD CPUs, it's still a 2 uop instruction, but with 1c latency each way.在早期的 AMD CPU 上,它仍然是 2 uop 指令,但单向有 1c 延迟。


xor-swaps or add/sub swaps or any other multi-instruction sequence other than mov are pointless compared to xchg for registers.寄存器的xchg相比, xor-swapsadd/sub swaps或除mov之外的任何其他多指令序列毫无意义 They all have 2 and 3 cycle latency, and larger code-size.它们都有 2 和 3 个周期的延迟,以及更大的代码大小。 The only thing that's worth considering is mov instructions.唯一值得考虑的是mov指令。

Or better, unroll a loop or rearrange your code to not need a swap, or to only need a mov .或者更好的是,展开循环或重新排列代码以不需要交换,或者只需要mov


Swapping a register with memory用内存交换寄存器

Note that xchg with memory has an implied lock prefix.请注意,带内存的xchg具有隐含的lock前缀。 Do not use xchg with memory unless performance doesn't matter at all, but code-size does.不要使用xchg内存,除非性能一点也不重要,但代码大小一样。 (eg in a bootloader). (例如在引导加载程序中)。 Or if you need it to be atomic and/or a full memory barrier, because it's both.或者,如果您需要它是原子的和/或完整的内存屏障,因为两者兼而有之。

( Fun fact: the implicit lock behaviour was new in 386. On 8086 through 286, xchg with mem isn't special unless you do lock xchg , so you can use it efficiently. But modern CPUs even in 16-bit mode do treat xchg mem, reg the same as lock xchg ) 有趣的事实:隐式lock行为在 386 中是新的。在 8086 到 286 上,带有 mem 的xchg并不特殊,除非您lock xchg ,因此您可以有效地使用它。但是即使在 16 位模式下的现代 CPU 也会处理xchg mem, reglock xchg相同)

So normally the most efficient thing to do is use another register:所以通常最有效的做法是使用另一个寄存器:

     ; emulate  xchg [mem], cx  efficiently for modern x86
   movzx  eax, word [mem]
   mov    [mem], cx
   mov    cx, ax

If you need to exchange a register with memory and don't have a free scratch register , xor-swap could in some cases be the best option.如果您需要用内存交换寄存器并且没有空闲的临时寄存器,在某些情况下异或交换可能是最佳选择。 Using temp memory would require copying the memory value (eg to the stack with push [mem] , or first spilling the register to a 2nd scratch memory location before loading+storing the memory operand.)使用临时内存需要复制内存值(例如使用push [mem]复制到堆栈,或者在加载+存储内存操作数之前首先将寄存器溢出到第二个暂存内存位置。)

The lowest latency way by far is still with a scratch register;迄今为止最低延迟的方式仍然是使用临时寄存器; often you can pick one that isn't on the critical path, or only needs to be reloaded (not saved in the first place, because the value's already in memory or can be recalculated from other registers with an ALU instruction).通常你可以选择一个不在关键路径上的,或者只需要重新加载(首先不保存,因为该值已经在内存中,或者可以使用 ALU 指令从其他寄存器重新计算)。

; spill/reload another register
push  edx            ; save/restore on the stack or anywhere else

movzx edx, word [mem]    ; or just mov dx, [mem]
mov   [mem], ax
mov   eax, edx

pop   edx            ; or better, just clobber a scratch reg

Two other reasonable (but much worse) options for swapping memory with a register are:用寄存器交换内存的另外两个合理(但更糟糕)的选项是:

  • not touching any other registers (except SP ):不接触任何其他寄存器( SP除外):

     ; using scratch space on the stack push [mem] ; [mem] can be any addressing mode, eg [bx] mov [mem], ax pop ax ; dep chain = load, store, reload.
  • or not touching anything else:或不接触其他任何东西:

     ; using no extra space anywhere xor ax, [mem] xor [mem], ax ; read-modify-write has store-forwarding + ALU latency xor ax, [mem] ; dep chain = load+xor, (parallel load)+xor+store, reload+xor

Using two memory-destination xor and one memory source would be worse throughput (more stores, and a longer dependency chain).使用两个内存目标xor或和一个内存源会降低吞吐量(更多存储和更长的依赖链)。

The push / pop version only works for operand-sizes that can be pushed/popped, but xor-swap works for any operand-size. push / pop版本仅适用于可以推送/弹出的操作数大小,但异或交换适用于任何操作数大小。 If you can use a temporary on the stack, the save/restore version is probably preferable, unless you need a balance of code-size and speed.如果您可以在堆栈上使用临时文件,则保存/恢复版本可能更可取,除非您需要在代码大小和速度之间取得平衡。

You can do it using some mathematical operation.您可以使用一些数学运算来做到这一点。 I can give you an idea.我可以给你一个主意。 Hope it helps!希望能帮助到你!

I have followed this C code:我遵循了这个 C 代码:

int i=10; j=20
i=i+j;
j=i-j;
i=i-j;

mov ax,10
mov bx,20
add ax,bx  
//mov command to copy data from accumulator to ax, I forgot the statement, now ax=30
sub bx,ax //accumulator vil b 10
//mov command to copy data from accumulator to bx, I forgot the statement now 
sub ax,bx //accumulator vil b 20
//mov command to copy data from accumulator to ax, I forgot the statement now 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM