[英]X86 opcodes to move xmm register to general registers
将 xmm0 寄存器移动到 eax 和 edx 的简短 x86 指令序列是什么?
Which parts of xmm0 do you want?你想要 xmm0 的哪些部分?
movd eax, xmm0
pextrd edx, xmm0, 1 ; SSE4.1
gets the low 64bits of xmm0 into edx:eax
.将 xmm0 的低 64 位放入
edx:eax
。 If you need all 4 parts, consider storing to memory and reloading: store-forwarding to loads has more latency but better throughput than shuffles (fewer total uops), especially if you can use them as memory source operands instead of just mov
.如果您需要所有 4 个部分,请考虑存储到内存并重新加载:与 shuffle 相比,存储转发到加载具有更多延迟但吞吐量更好(更少的总 uops),特别是如果您可以将它们用作内存源操作数而不仅仅是
mov
。
(But if you want a horizontal sum or something, normally do that with SIMD shuffles like pshufd
/ paddd
twice to reduce 4 elements to 2 then to 1. Although movd eax, xmm0
/ movdqa [esp], xmm0
store, and 3 scalar add eax, [esp + 4/8/12]
is actually not bad for total uops or latency in this case, unlike scalar FP where latency is higher and you want the result in an XMM reg anyway.) (但如果你想要一个水平总和或其他东西,通常用像
pshufd
/ paddd
这样的SIMD shuffle paddd
两次以将 4 个元素减少到 2 然后减少到 1。虽然movd eax, xmm0
/ movdqa [esp], xmm0
store 和 3 scalar add eax, [esp + 4/8/12]
在这种情况下, add eax, [esp + 4/8/12]
实际上对于总 uops 或延迟来说并不坏,不像标量 FP,其中延迟更高并且无论如何您都希望在 XMM reg 中得到结果。)
In 64bit code, movq rax, xmm0
/ shld rdx, rax, 32
might be better than pextrd
, and doesn't require SSE4.1.在 64 位代码中,
movq rax, xmm0
/ shld rdx, rax, 32
可能比pextrd
更好,并且不需要 SSE4.1。
A more normal mov rdx, rax
/ shr rdx, 32
might be more efficient than SHLD, even though it costs more uops on Intel CPUs.更普通的
mov rdx, rax
/ shr rdx, 32
可能比 SHLD 更有效,即使它在 Intel CPU 上花费更多的 uops。 shld
is slow on AMD CPUs, 8 uops on Zen. shld
在 AMD CPU 上很慢,在 Zen 上为 8 shld
。 ( https://uops.info/ ) ( https://uops.info/ )
BMI2 rorx rdx, rax, 32
a good way to copy-and-shift, and is efficient on all CPUs that support it. BMI2
rorx rdx, rax, 32
是一种复制和移位的好方法,并且在支持它的所有 CPU 上都很有效。 It of course leaves the high half of RDX probably non-zero, but that's fine.当然,RDX 的上半部分可能不为零,但这很好。
Another option would be to movd
/ movq
, if you're not close to bottlenecked on throughput for the single port they compete for.另一种选择是
movd
/ movq
,如果您没有接近它们竞争的单个端口的吞吐量瓶颈。 On most CPUs they can't actually run in parallel, so movd/movq competing for a port does still cost latency for the 2nd one.在大多数 CPU 上,它们实际上不能并行运行,因此竞争端口的 movd/movq 仍然会导致第二个端口的延迟。 On a modern CPU with mov-elimination (Zen or IvyBridge),
mov rdx, rax
with zero latency is better.在具有 mov-elimination(Zen 或 IvyBridge)的现代 CPU 上,
mov rdx, rax
零延迟更好。 But this does get your values in EAX and EDX zero-extended into RAX and RDX.但这确实使您在 EAX 和 EDX 中的值零扩展到 RAX 和 RDX。
movq rdx, xmm0
movd eax, xmm0 ; or schedule this first if you can use EAX right away
shr rdx, 32
See the x86 tag wiki for instruction-set references and other stuff.有关指令集参考和其他内容,请参阅x86标签 wiki。
See Agner Fog's excellent Optimizing Assembly guide for tips on which instructions to use.有关使用说明的提示,请参阅Agner Fog 出色的 Optimizing Assembly 指南。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.