将 xmm 寄存器移动到通用寄存器的 X86 操作码

Question

将 xmm0 寄存器移动到 eax 和 edx 的简短 x86 指令序列是什么？

Answer 1

Which parts of xmm0 do you want?你想要 xmm0 的哪些部分？

movd     eax, xmm0
pextrd   edx, xmm0, 1    ; SSE4.1

gets the low 64bits of xmm0 into edx:eax .将 xmm0 的低 64 位放入edx:eax 。 If you need all 4 parts, consider storing to memory and reloading: store-forwarding to loads has more latency but better throughput than shuffles (fewer total uops), especially if you can use them as memory source operands instead of just mov .如果您需要所有 4 个部分，请考虑存储到内存并重新加载：与 shuffle 相比，存储转发到加载具有更多延迟但吞吐量更好（更少的总 uops），特别是如果您可以将它们用作内存源操作数而不仅仅是mov 。

(But if you want a horizontal sum or something, normally do that with SIMD shuffles like pshufd / paddd twice to reduce 4 elements to 2 then to 1. Although movd eax, xmm0 / movdqa [esp], xmm0 store, and 3 scalar add eax, [esp + 4/8/12] is actually not bad for total uops or latency in this case, unlike scalar FP where latency is higher and you want the result in an XMM reg anyway.) （但如果你想要一个水平总和或其他东西，通常用像pshufd / paddd这样的SIMD shuffle paddd两次以将 4 个元素减少到 2 然后减少到 1。虽然movd eax, xmm0 / movdqa [esp], xmm0 store 和 3 scalar add eax, [esp + 4/8/12]在这种情况下， add eax, [esp + 4/8/12]实际上对于总 uops 或延迟来说并不坏，不像标量 FP，其中延迟更高并且无论如何您都希望在 XMM reg 中得到结果。）

In 64bit code, movq rax, xmm0 / shld rdx, rax, 32 might be better than pextrd , and doesn't require SSE4.1.在 64 位代码中， movq rax, xmm0 / shld rdx, rax, 32可能比pextrd更好，并且不需要 SSE4.1。

A more normal mov rdx, rax / shr rdx, 32 might be more efficient than SHLD, even though it costs more uops on Intel CPUs.更普通的mov rdx, rax / shr rdx, 32可能比 SHLD 更有效，即使它在 Intel CPU 上花费更多的 uops。 shld is slow on AMD CPUs, 8 uops on Zen. shld在 AMD CPU 上很慢，在 Zen 上为 8 shld 。 ( https://uops.info/ ) ( https://uops.info/ )

BMI2 rorx rdx, rax, 32 a good way to copy-and-shift, and is efficient on all CPUs that support it. BMI2 rorx rdx, rax, 32是一种复制和移位的好方法，并且在支持它的所有 CPU 上都很有效。 It of course leaves the high half of RDX probably non-zero, but that's fine.当然，RDX 的上半部分可能不为零，但这很好。

Another option would be to movd / movq , if you're not close to bottlenecked on throughput for the single port they compete for.另一种选择是movd / movq ，如果您没有接近它们竞争的单个端口的吞吐量瓶颈。 On most CPUs they can't actually run in parallel, so movd/movq competing for a port does still cost latency for the 2nd one.在大多数 CPU 上，它们实际上不能并行运行，因此竞争端口的 movd/movq 仍然会导致第二个端口的延迟。 On a modern CPU with mov-elimination (Zen or IvyBridge), mov rdx, rax with zero latency is better.在具有 mov-elimination（Zen 或 IvyBridge）的现代 CPU 上， mov rdx, rax零延迟更好。 But this does get your values in EAX and EDX zero-extended into RAX and RDX.但这确实使您在 EAX 和 EDX 中的值零扩展到 RAX 和 RDX。

    movq  rdx, xmm0
    movd  eax, xmm0       ; or schedule this first if you can use EAX right away
    shr   rdx, 32

See the x86 tag wiki for instruction-set references and other stuff.有关指令集参考和其他内容，请参阅x86标签 wiki。

See Agner Fog's excellent Optimizing Assembly guide for tips on which instructions to use.有关使用说明的提示，请参阅Agner Fog 出色的 Optimizing Assembly 指南。

将 xmm 寄存器移动到通用寄存器的 X86 操作码

问题描述

1 个解决方案

解决方案1
7 已采纳 2016-06-10 04:53:08

将 xmm 寄存器移动到通用寄存器的 X86 操作码

问题描述

1 个解决方案

解决方案1 7 已采纳 2016-06-10 04:53:08

解决方案1
7 已采纳 2016-06-10 04:53:08