可以在不使用通用寄存器的情况下将 8 位从 XMM 寄存器移动到内存吗？

Question

I need to move 1 byte from an xmm register to memory without using general purpose registers.我需要在不使用通用寄存器的情况下将 1 个字节从 xmm 寄存器移动到内存。 And also I can't use SSE4.1.而且我也不能使用 SSE4.1。 It is possible?有可能的？

=( =（

Answer 1

Normally you'd want to avoid this in the first place.通常，您首先要避免这种情况。 For example, instead of doing separate byte stores, can you do one wider load and merge ( pand/pandn/por if you don't have pblendvb ), then store back the merge result?例如，您是否可以进行更广泛的加载和合并（如果没有pblendvb pand/pandn/por ，而不是进行单独的字节存储，然后将合并结果存储回来？

That's not thread-safe (non-atomic RMW of the unmodified bytes), but as long as you know the bytes you're RMWing don't extend past the end of the array or struct, and no other threads are doing the same thing to other elements in the same array/struct, it's the normal way to do stuff like upper-case every lower-case letter in a string while leaving other bytes unmodified.这不是线程安全的（未修改字节的非原子 RMW），但是只要您知道 RMWing 的字节不会超出数组或结构的末尾，并且没有其他线程在做同样的事情对于同一数组/结构中的其他元素，这是在不修改其他字节的情况下对字符串中的每个小写字母进行大写等操作的正常方法。

Single-uop stores are only possible from vector registers in 4, 8, 16, 32, or 64-byte sizes, except with AVX-512BW masked stores with only 1 element unmasked.单 uop 存储只能来自4、8、16、32或 64 字节大小的向量寄存器，除了AVX-512BW 屏蔽存储，只有 1 个未屏蔽元素。 Narrower stores like pextrb involve a shuffle uop to extract the 2 or 1 byte to be stored.像pextrb这样的pextrb窄的存储涉及一个 shuffle uop 来提取要存储的 2 或 1 个字节。

The only good way to truly store exactly 1 byte without GP integer regs is with SSE4.1 pextrb [mem], xmm0, 0..15 .在没有GP整数regs的情况下真正存储1个字节的唯一好方法是使用SSE4.1 pextrb [mem], xmm0, 0..15 。 That's still a shuffle + store even with an immediate 0 on current CPUs.即使在当前 CPU 上立即为0 ，这仍然是 shuffle + store。 If you can safely write 2 bytes at the destination location, SSE2 pextrw is usable.如果您可以安全地在目标位置写入 2 个字节，则可以使用 SSE2 pextrw 。

You could use an SSE2 maskmovdqu byte-masked store (with a 0xff,0,0,... mask), but you don't want to because it's much slower than movd eax, xmm0 / mov [mem], al .您可以使用SSE2 maskmovdqu字节掩码存储（带有0xff,0,0,...掩码），但您不想使用，因为它比movd eax, xmm0 / mov [mem], al慢得多。 eg on Skylake, 10 uops, 1 per 6 cycle throughput.例如，在 Skylake 上，10 uop，每 6 个周期吞吐量 1 个。

And it's worse than that if you want to reload the byte after, because (unlike AVX / AVX-512 masked stores), maskmovdqu has NT semantics like movntps (bypass cache, or evict the cache line if previously hot).如果您想在之后重新加载字节，情况会更糟，因为（与 AVX / AVX-512 掩码存储不同）， maskmovdqu具有 NT 语义，如movntps （绕过缓存，或者如果以前很热，则驱逐缓存行）。

If your requirement is fully artificial and you just want to play silly computer tricks (avoiding ever having your data in registers), you could also set up scratch space eg on the stack and use movsb to copy it:如果您的要求完全是人为的，而您只想玩一些愚蠢的计算机技巧（避免将数据放在寄存器中），您还可以设置暂存空间，例如在堆栈上并使用movsb复制它：

;; with destination address already in RDI
    lea  rsi, [rsp-4]          ; scratch space in the red zone below RSP on non-Windows
    movd  [rsi], xmm0
    movsb                   ; copy a byte, [rdi] <- [rsi], incrementing RSI and RDI

This is obviously slower than the normal way and needed an extra register (RSI) for the tmp buffer address.这显然比正常方式慢，并且需要一个额外的寄存器 (RSI) 用于 tmp 缓冲区地址。 And you need the exact destination address in RDI, not [rel foo] static storage or any other flexible addressing mode.并且您需要 RDI 中的确切目标地址，而不是[rel foo]静态存储或任何其他灵活寻址模式。

pop can also copy mem-to-mem, but is only available with 16-bit and 64-bit operand-size, so it can't save you from needing RSI and RDI. pop也可以复制内存到内存，但仅适用于 16 位和 64 位操作数大小，因此它无法使您免于需要 RSI 和 RDI。

Since the above way needs an extra register, it's worse in pretty much every way than the normal way:由于上述方式需要一个额外的寄存器，它在几乎所有方面都比正常方式更糟糕：

   movd  esi, xmm0            ; pick any register.
   mov   [rdi], sil           ; al..dl would avoid needing a REX prefix for low-8


;; or even use a register where you can read the low and high bytes separately
   movd  eax, xmm0
   mov   [rdi], al            ; no REX prefix needed, more compact than SIL
   mov   [rsi], ah            ; scatter two bytes reasonably efficiently
   shr   eax, 16              ; bring down the next 2 bytes

(Reading AH has an extra cycle of latency on current Intel CPUs, but it's fine for throughput, and we're storing here anyway so latency isn't much of a factor.) （在当前的 Intel CPU 上读取 AH 有一个额外的延迟周期，但它对吞吐量来说很好，而且我们无论如何都在这里存储，所以延迟不是一个很大的因素。）

xmm -> GP integer transfers are not slow on most CPUs. xmm -> GP 整数传输在大多数 CPU 上并不慢。 (Bulldozer-family is the outlier, but it's still comparable latency to store/reload; Agner Fog said in his microarch guide ( https://agner.org/optimize/ ) he found AMD's optimization-manual suggestion to store/reload was not faster.) （推土机系列是异常值，但存储/重新加载的延迟仍然相当；Agner Fog 在他的微架构指南 ( https://agner.org/optimize/ ) 中说，他发现 AMD 的存储/重新加载优化手册建议不是快点。）

It's hard to imagine a case where movsb could be better, since you already need a free register for that way, and movsb is multiple uops.很难想象movsb会更好的情况，因为您已经需要一个免费的寄存器，而movsb是多个 uops。 Possibly if bottlenecked on port 0 uops for movd r32, xmm on current Intel CPUs?可能是否在当前 Intel CPU 上的movd r32, xmm端口 0 movd r32, xmm上movd r32, xmm瓶颈？ ( https://uops.info/ ) ( https://uops.info/ )

可以在不使用通用寄存器的情况下将 8 位从 XMM 寄存器移动到内存吗？

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-06-28 18:32:06

可以在不使用通用寄存器的情况下将 8 位从 XMM 寄存器移动到内存吗？

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-06-28 18:32:06

解决方案1
1 已采纳 2021-06-28 18:32:06