简体   繁体   English

将 xmm 寄存器移动到通用寄存器的 X86 操作码

[英]X86 opcodes to move xmm register to general registers

将 xmm0 寄存器移动到 eax 和 edx 的简短 x86 指令序列是什么?

Which parts of xmm0 do you want?你想要 xmm0 的哪些部分?

movd     eax, xmm0
pextrd   edx, xmm0, 1    ; SSE4.1

gets the low 64bits of xmm0 into edx:eax .将 xmm0 的低 64 位放入edx:eax If you need all 4 parts, consider storing to memory and reloading: store-forwarding to loads has more latency but better throughput than shuffles (fewer total uops), especially if you can use them as memory source operands instead of just mov .如果您需要所有 4 个部分,请考虑存储到内存并重新加载:与 shuffle 相比,存储转发到加载具有更多延迟但吞吐量更好(更少的总 uops),特别是如果您可以将它们用作内存源操作数而不仅仅是mov

(But if you want a horizontal sum or something, normally do that with SIMD shuffles like pshufd / paddd twice to reduce 4 elements to 2 then to 1. Although movd eax, xmm0 / movdqa [esp], xmm0 store, and 3 scalar add eax, [esp + 4/8/12] is actually not bad for total uops or latency in this case, unlike scalar FP where latency is higher and you want the result in an XMM reg anyway.) (但如果你想要一个水平总和或其他东西,通常pshufd / paddd这样的SIMD shuffle paddd两次以将 4 个元素减少到 2 然后减少到 1。虽然movd eax, xmm0 / movdqa [esp], xmm0 store 和 3 scalar add eax, [esp + 4/8/12]在这种情况下, add eax, [esp + 4/8/12]实际上对于总 uops 或延迟来说并不坏,不像标量 FP,其中延迟更高并且无论如何您都希望在 XMM reg 中得到结果。)


In 64bit code, movq rax, xmm0 / shld rdx, rax, 32 might be better than pextrd , and doesn't require SSE4.1.在 64 位代码中, movq rax, xmm0 / shld rdx, rax, 32可能比pextrd更好,并且不需要 SSE4.1。

A more normal mov rdx, rax / shr rdx, 32 might be more efficient than SHLD, even though it costs more uops on Intel CPUs.更普通的mov rdx, rax / shr rdx, 32可能比 SHLD 更有效,即使它在 Intel CPU 上花费更多的 uops。 shld is slow on AMD CPUs, 8 uops on Zen. shld在 AMD CPU 上很慢,在 Zen 上为 8 shld ( https://uops.info/ ) ( https://uops.info/ )

BMI2 rorx rdx, rax, 32 a good way to copy-and-shift, and is efficient on all CPUs that support it. BMI2 rorx rdx, rax, 32是一种复制和移位的好方法,并且在支持它的所有 CPU 上都很有效。 It of course leaves the high half of RDX probably non-zero, but that's fine.当然,RDX 的上半部分可能不为零,但这很好。

Another option would be to movd / movq , if you're not close to bottlenecked on throughput for the single port they compete for.另一种选择是movd / movq ,如果您没有接近它们竞争的单个端口的吞吐量瓶颈。 On most CPUs they can't actually run in parallel, so movd/movq competing for a port does still cost latency for the 2nd one.在大多数 CPU 上,它们实际上不能并行运行,因此竞争端口的 movd/movq 仍然会导致第二个端口的延迟。 On a modern CPU with mov-elimination (Zen or IvyBridge), mov rdx, rax with zero latency is better.在具有 mov-elimination(Zen 或 IvyBridge)的现代 CPU 上, mov rdx, rax零延迟更好。 But this does get your values in EAX and EDX zero-extended into RAX and RDX.但这确实使您在 EAX 和 EDX 中的值零扩展到 RAX 和 RDX。

    movq  rdx, xmm0
    movd  eax, xmm0       ; or schedule this first if you can use EAX right away
    shr   rdx, 32

See the tag wiki for instruction-set references and other stuff.有关指令集参考和其他内容,请参阅标签 wiki。

See Agner Fog's excellent Optimizing Assembly guide for tips on which instructions to use.有关使用说明的提示,请参阅Agner Fog 出色的 Optimizing Assembly 指南

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM