简体   繁体   English

可以在不使用通用寄存器的情况下将 8 位从 XMM 寄存器移动到内存吗?

[英]It is possible move 8 bits from an XMM register to memory without using general purpose registers?

I need to move 1 byte from an xmm register to memory without using general purpose registers.我需要在不使用通用寄存器的情况下将 1 个字节从 xmm 寄存器移动到内存。 And also I can't use SSE4.1.而且我也不能使用 SSE4.1。 It is possible?有可能的?

=( =(

Normally you'd want to avoid this in the first place.通常,您首先要避免这种情况。 For example, instead of doing separate byte stores, can you do one wider load and merge ( pand/pandn/por if you don't have pblendvb ), then store back the merge result?例如,您是否可以进行更广泛的加载和合并(如果没有pblendvb pand/pandn/por ,而不是进行单独的字节存储,然后将合并结果存储回来?

That's not thread-safe (non-atomic RMW of the unmodified bytes), but as long as you know the bytes you're RMWing don't extend past the end of the array or struct, and no other threads are doing the same thing to other elements in the same array/struct, it's the normal way to do stuff like upper-case every lower-case letter in a string while leaving other bytes unmodified.这不是线程安全的(未修改字节的非原子 RMW),但是只要您知道 RMWing 的字节不会超出数组或结构的末尾,并且没有其他线程在做同样的事情对于同一数组/结构中的其他元素,这是在不修改其他字节的情况下对字符串中的每个小写字母进行大写等操作的正常方法。


Single-uop stores are only possible from vector registers in 4, 8, 16, 32, or 64-byte sizes, except with AVX-512BW masked stores with only 1 element unmasked.单 uop 存储只能来自4、8、16、32或 64 字节大小的向量寄存器,除了AVX-512BW 屏蔽存储,只有 1 个未屏蔽元素。 Narrower stores like pextrb involve a shuffle uop to extract the 2 or 1 byte to be stored.pextrb这样的pextrb窄的存储涉及一个 shuffle uop 来提取要存储的 2 或 1 个字节。

The only good way to truly store exactly 1 byte without GP integer regs is with SSE4.1 pextrb [mem], xmm0, 0..15 .在没有GP整数regs的情况下真正存储1个字节的唯一好方法是使用SSE4.1 pextrb [mem], xmm0, 0..15 That's still a shuffle + store even with an immediate 0 on current CPUs.即使在当前 CPU 上立即为0 ,这仍然是 shuffle + store。 If you can safely write 2 bytes at the destination location, SSE2 pextrw is usable.如果您可以安全地在目标位置写入 2 个字节,则可以使用 SSE2 pextrw

You could use an SSE2 maskmovdqu byte-masked store (with a 0xff,0,0,... mask), but you don't want to because it's much slower than movd eax, xmm0 / mov [mem], al .可以使用SSE2 maskmovdqu字节掩码存储(带有0xff,0,0,...掩码),但您不想使用,因为它比movd eax, xmm0 / mov [mem], al慢得多。 eg on Skylake, 10 uops, 1 per 6 cycle throughput.例如,在 Skylake 上,10 uop,每 6 个周期吞吐量 1 个。

And it's worse than that if you want to reload the byte after, because (unlike AVX / AVX-512 masked stores), maskmovdqu has NT semantics like movntps (bypass cache, or evict the cache line if previously hot).如果您想在之后重新加载字节,情况会更糟,因为(与 AVX / AVX-512 掩码存储不同), maskmovdqu具有 NT 语义,如movntps (绕过缓存,或者如果以前很热,则驱逐缓存行)。


If your requirement is fully artificial and you just want to play silly computer tricks (avoiding ever having your data in registers), you could also set up scratch space eg on the stack and use movsb to copy it:如果您的要求完全是人为的,而您只想玩一些愚蠢的计算机技巧(避免将数据放在寄存器中),您还可以设置暂存空间,例如在堆栈上并使用movsb复制它:

;; with destination address already in RDI
    lea  rsi, [rsp-4]          ; scratch space in the red zone below RSP on non-Windows
    movd  [rsi], xmm0
    movsb                   ; copy a byte, [rdi] <- [rsi], incrementing RSI and RDI

This is obviously slower than the normal way and needed an extra register (RSI) for the tmp buffer address.这显然比正常方式慢,并且需要一个额外的寄存器 (RSI) 用于 tmp 缓冲区地址。 And you need the exact destination address in RDI, not [rel foo] static storage or any other flexible addressing mode.并且您需要 RDI 中的确切目标地址,而不是[rel foo]静态存储或任何其他灵活寻址模式。

pop can also copy mem-to-mem, but is only available with 16-bit and 64-bit operand-size, so it can't save you from needing RSI and RDI. pop也可以复制内存到内存,但仅适用于 16 位和 64 位操作数大小,因此它无法使您免于需要 RSI 和 RDI。

Since the above way needs an extra register, it's worse in pretty much every way than the normal way:由于上述方式需要一个额外的寄存器,它在几乎所有方面都比正常方式更糟糕:

   movd  esi, xmm0            ; pick any register.
   mov   [rdi], sil           ; al..dl would avoid needing a REX prefix for low-8


;; or even use a register where you can read the low and high bytes separately
   movd  eax, xmm0
   mov   [rdi], al            ; no REX prefix needed, more compact than SIL
   mov   [rsi], ah            ; scatter two bytes reasonably efficiently
   shr   eax, 16              ; bring down the next 2 bytes

(Reading AH has an extra cycle of latency on current Intel CPUs, but it's fine for throughput, and we're storing here anyway so latency isn't much of a factor.) (在当前的 Intel CPU 上读取 AH 有一个额外的延迟周期,但它对吞吐量来说很好,而且我们无论如何都在这里存储,所以延迟不是一个很大的因素。)

xmm -> GP integer transfers are not slow on most CPUs. xmm -> GP 整数传输在大多数 CPU 上并不慢。 (Bulldozer-family is the outlier, but it's still comparable latency to store/reload; Agner Fog said in his microarch guide ( https://agner.org/optimize/ ) he found AMD's optimization-manual suggestion to store/reload was not faster.) (推土机系列是异常值,但存储/重新加载的延迟仍然相当;Agner Fog 在他的微架构指南 ( https://agner.org/optimize/ ) 中说,他发现 AMD 的存储/重新加载优化手册建议不是快点。)

It's hard to imagine a case where movsb could be better, since you already need a free register for that way, and movsb is multiple uops.很难想象movsb会更好的情况,因为您已经需要一个免费的寄存器,而movsb是多个 uops。 Possibly if bottlenecked on port 0 uops for movd r32, xmm on current Intel CPUs?可能是否在当前 Intel CPU 上的movd r32, xmm端口 0 movd r32, xmmmovd r32, xmm瓶颈? ( https://uops.info/ ) ( https://uops.info/ )

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将 2 个 QWORD 从通用寄存器移动到 XMM 寄存器作为高/低 - Moving 2 QWORDs from general purpose registers into an XMM register as high/low 从/向xmm / ymm寄存器加载/存储通用寄存器的最佳方法 - Best way to load/store from/to general purpose registers to/from xmm/ymm register 将 xmm 寄存器移动到通用寄存器的 X86 操作码 - X86 opcodes to move xmm register to general registers 128 位值 - 从 XMM 寄存器到通用 - 128-bit values - From XMM registers to General Purpose 如何不使用寄存器将128位xmm直接移动到内存中? - How to move 128-bit xmm directly to memory without using registers? 如何将 96 位从内存加载到 XMM 寄存器中? - How to load 96 bits from memory into an XMM register? 在ml64中的xmm和通用寄存器之间移动四字? - Move quadword between xmm and general-purpose register in ml64? 使用 xmm 寄存器来保存通用寄存器是否安全? - is it safe to use xmm registers to save the general-purpose ones? 将单个字节从存储器移动到浮点数的xmm寄存器 - Move single byte from memory to xmm register as float 使用XMM0寄存器和内存提取(C ++代码)的速度是仅使用XMM寄存器的ASM的两倍 - 为什么? - Using XMM0 register and memory fetches (C++ code) is twice as fast as ASM only using XMM registers - Why?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM