[英]swapping 2 registers in 8086 assembly language(16 bits)
Does someone know how to swap the values of 2 registers without using another variable, register, stack, or any other storage location?有人知道如何在不使用其他变量、寄存器、堆栈或任何其他存储位置的情况下交换 2 个寄存器的值吗? thanks!
谢谢!
Like swapping AX, BX.就像交换AX,BX。
8086 has an instruction for this: 8086 对此有一个说明:
xchg ax, bx
If you really need to swap two regs, xchg ax, bx
is the most efficient way on all x86 CPUs in most cases , modern and ancient including 8086. (You could construct a case where multiple single-uop instructions might be more efficient because of some other weird front-end effect due to surrounding code. Or for 32-bit operand size, where zero-latency mov
made a 3-mov sequence with a temporary register better on Intel CPUs).如果您真的需要交换两个 regs,
xchg ax, bx
在大多数情况下是所有 x86 CPU 上最有效的方式,现代和古代包括 8086。(您可以构建一个案例,其中多个单指令可能更高效,因为由于周围代码而导致的其他一些奇怪的前端效果。或者对于 32 位操作数大小,零延迟mov
在 Intel CPU 上使用临时寄存器更好地制作了 3-mov 序列)。
For code-size;对于代码大小; xchg-with-ax only takes a single byte.
xchg-with-ax只需要一个字节。 This is where the 0x90 NOP encoding comes from: it's
xchg ax, ax
, or xchg eax, eax
in 32-bit mode 1 .这就是0x90 NOP编码的来源:它是
xchg ax, ax
或xchg eax, eax
in 32-bit mode 1 。 Exchanging any other pair of registers takes 2 bytes for the xchg r, r/m
encoding.交换任何其他寄存器对需要 2 个字节用于
xchg r, r/m
编码。 (+ REX prefix if required in 64-bit mode.) (如果在 64 位模式下需要,则 + REX 前缀。)
On an actual 8086, code-fetch was usually the performance bottleneck, so xchg
is by far the best way, especially using the single-byte xchg-with-ax short form.在实际的 8086 上,code-fetch 通常是性能瓶颈,因此
xchg
是迄今为止最好的方法,尤其是使用单字节xchg-with-ax短格式。
Footnote 1: (In 64-bit mode, xchg eax, eax
would truncate RAX
to 32 bits, so 0x90 is explicitly a nop
instruction, not also an xchg
).脚注 1:(在 64 位模式下,
xchg eax, eax
会将RAX
截断为 32 位,因此 0x90 明确是nop
指令,而不是xchg
)。
For 32-bit / 64-bit registers, 3 mov
instructions with a temporary could benefit from mov-elimination where xchg
can't on current Intel CPUs.对于 32 位 / 64 位寄存器,3 条带有临时指令的
mov
指令可以从mov-elimination中受益,而xchg
在当前的 Intel CPU 上无法实现。 xchg
is 3 uops on Intel, all of them having 1c latency and needing an execution unit, so one direction has 2c latency but the other has 1c latency. xchg
在 Intel 上是 3 uops,所有这些都具有 1c 延迟并且需要一个执行单元,因此一个方向具有 2c 延迟,而另一个具有 1c 延迟。 See Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures?请参阅为什么 XCHG reg, reg 是现代英特尔架构上的 3 微操作指令? for more microarchitectural details about how current CPUs implement it.
有关当前 CPU 如何实现它的更多微架构细节。
On AMD Ryzen, xchg
on 32/64-bit regs is 2 uops and is handled in the rename stage, so it's like two mov
instructions that run in parallel.在 AMD Ryzen 上,32/64 位 regs 上的
xchg
是 2 uop,并在重命名阶段处理,因此它就像两个并行运行的mov
指令。 On earlier AMD CPUs, it's still a 2 uop instruction, but with 1c latency each way.在早期的 AMD CPU 上,它仍然是 2 uop 指令,但单向有 1c 延迟。
xor-swaps or add/sub swaps or any other multi-instruction sequence other than mov
are pointless compared to xchg
for registers.与寄存器的
xchg
相比, xor-swaps或add/sub swaps或除mov
之外的任何其他多指令序列毫无意义。 They all have 2 and 3 cycle latency, and larger code-size.它们都有 2 和 3 个周期的延迟,以及更大的代码大小。 The only thing that's worth considering is
mov
instructions.唯一值得考虑的是
mov
指令。
Or better, unroll a loop or rearrange your code to not need a swap, or to only need a mov
.或者更好的是,展开循环或重新排列代码以不需要交换,或者只需要
mov
。
Note that xchg
with memory has an implied lock
prefix.请注意,带内存的
xchg
具有隐含的lock
前缀。 Do not use xchg
with memory unless performance doesn't matter at all, but code-size does.不要使用
xchg
内存,除非性能一点也不重要,但代码大小一样。 (eg in a bootloader). (例如在引导加载程序中)。 Or if you need it to be atomic and/or a full memory barrier, because it's both.
或者,如果您需要它是原子的和/或完整的内存屏障,因为两者兼而有之。
( Fun fact: the implicit lock
behaviour was new in 386. On 8086 through 286, xchg
with mem isn't special unless you do lock xchg
, so you can use it efficiently. But modern CPUs even in 16-bit mode do treat xchg mem, reg
the same as lock xchg
) (有趣的事实:隐式
lock
行为在 386 中是新的。在 8086 到 286 上,带有 mem 的xchg
并不特殊,除非您lock xchg
,因此您可以有效地使用它。但是即使在 16 位模式下的现代 CPU 也会处理xchg mem, reg
与lock xchg
相同)
So normally the most efficient thing to do is use another register:所以通常最有效的做法是使用另一个寄存器:
; emulate xchg [mem], cx efficiently for modern x86
movzx eax, word [mem]
mov [mem], cx
mov cx, ax
If you need to exchange a register with memory and don't have a free scratch register , xor-swap could in some cases be the best option.如果您需要用内存交换寄存器并且没有空闲的临时寄存器,在某些情况下异或交换可能是最佳选择。 Using temp memory would require copying the memory value (eg to the stack with
push [mem]
, or first spilling the register to a 2nd scratch memory location before loading+storing the memory operand.)使用临时内存需要复制内存值(例如使用
push [mem]
复制到堆栈,或者在加载+存储内存操作数之前首先将寄存器溢出到第二个暂存内存位置。)
The lowest latency way by far is still with a scratch register;迄今为止最低延迟的方式仍然是使用临时寄存器; often you can pick one that isn't on the critical path, or only needs to be reloaded (not saved in the first place, because the value's already in memory or can be recalculated from other registers with an ALU instruction).
通常你可以选择一个不在关键路径上的,或者只需要重新加载(首先不保存,因为该值已经在内存中,或者可以使用 ALU 指令从其他寄存器重新计算)。
; spill/reload another register
push edx ; save/restore on the stack or anywhere else
movzx edx, word [mem] ; or just mov dx, [mem]
mov [mem], ax
mov eax, edx
pop edx ; or better, just clobber a scratch reg
Two other reasonable (but much worse) options for swapping memory with a register are:用寄存器交换内存的另外两个合理(但更糟糕)的选项是:
not touching any other registers (except SP
):不接触任何其他寄存器(
SP
除外):
; using scratch space on the stack push [mem] ; [mem] can be any addressing mode, eg [bx] mov [mem], ax pop ax ; dep chain = load, store, reload.
or not touching anything else:或不接触其他任何东西:
; using no extra space anywhere xor ax, [mem] xor [mem], ax ; read-modify-write has store-forwarding + ALU latency xor ax, [mem] ; dep chain = load+xor, (parallel load)+xor+store, reload+xor
Using two memory-destination xor
and one memory source would be worse throughput (more stores, and a longer dependency chain).使用两个内存目标
xor
或和一个内存源会降低吞吐量(更多存储和更长的依赖链)。
The push
/ pop
version only works for operand-sizes that can be pushed/popped, but xor-swap works for any operand-size. push
/ pop
版本仅适用于可以推送/弹出的操作数大小,但异或交换适用于任何操作数大小。 If you can use a temporary on the stack, the save/restore version is probably preferable, unless you need a balance of code-size and speed.如果您可以在堆栈上使用临时文件,则保存/恢复版本可能更可取,除非您需要在代码大小和速度之间取得平衡。
You can do it using some mathematical operation.您可以使用一些数学运算来做到这一点。 I can give you an idea.
我可以给你一个主意。 Hope it helps!
希望能帮助到你!
I have followed this C code:我遵循了这个 C 代码:
int i=10; j=20
i=i+j;
j=i-j;
i=i-j;
mov ax,10
mov bx,20
add ax,bx
//mov command to copy data from accumulator to ax, I forgot the statement, now ax=30
sub bx,ax //accumulator vil b 10
//mov command to copy data from accumulator to bx, I forgot the statement now
sub ax,bx //accumulator vil b 20
//mov command to copy data from accumulator to ax, I forgot the statement now
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.