简体   繁体   English

将 XMM 寄存器设置为重复字节模式(广播一个常量字节)

[英]Set an XMM register to a repeating byte pattern (broadcast a constant byte)

I know that we can do something like this to move a character to a xmm register:我知道我们可以做这样的事情来将一个字符移动到一个 xmm 寄存器:

movaps xmm1, xword [.__0x20]

align 16
.__0x20 db 0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20

but since this is a memory process, i want to know if there is any better way?但由于这是一个记忆过程,我想知道是否有更好的方法? (also, im talking about SSE2 not other SIMD types ...) (另外,我在谈论 SSE2 而不是其他 SIMD 类型......)

i want to each byte of xmm1 register be 0x20 not only one byte ..我希望 xmm1 寄存器的每个字节都是 0x20,而不仅仅是一个字节..

(Editor's note: this can be called a broadcast or splat. (编者注:这可以称为广播或 splat。
It's what the _mm_set1_epi8(0x20) intrinsic does.)这就是_mm_set1_epi8(0x20)内在函数所做的。)

With only SSE2, loading the full pattern from memory is generally your best bet.仅使用 SSE2,从内存中加载完整模式通常是您最好的选择。

In your NASM source you can use times 16 db 0x20 for easy maintainability.在您的 NASM 源代码中,您可以使用times 16 db 0x20以便于维护。


With SSE3 you can do 8-byte broadcast loads with movddup .使用 SSE3,您可以使用movddup执行 8 字节广播加载。 With AVX you can do a 4-byte broadcast-load with vbroadcastss .使用 AVX,您可以使用vbroadcastss进行 4 字节广播加载。 These broadcast-loads are very good on modern CPUs, running on just the load port, not needing a shuffle uop.这些广播负载在现代 CPU 上非常好,在负载端口上运行,不需要 shuffle uop。 ie they're exactly as cheap as movaps on CPUs that support them, except for a byte or two more code-size.即它们与支持它们的 CPU 上的movaps完全一样便宜,除了一两个字节的代码大小。 Same for vbroadcastf128 to YMM registers. vbroadcastf128到 YMM 寄存器也是如此。

Most compilers don't seem to realize this and will do constant-propagation through _mm_set1 even when that results in a 32 byte constant instead of 4 bytes, even when just mov... loading it ahead of a loop, not folding it into a memory operand for an ALU instruction.大多数编译器似乎没有意识到这一点,并且会通过_mm_set1进行常量传播,即使这会导致 32 字节的常量而不是 4 字节,即使只是mov...在循环之前加载它,而不是将其折叠成ALU 指令的内存操作数。 (And that's still possible with broadcast-loading when AVX512 is available.) Clang does sometimes take advantage of broadcast loads for simple constants. (当 AVX512 可用时,广播加载仍然可以实现。)Clang 有时会利用广播加载来获取简单的常量。

AVX2 adds vpbroadcastb/w/d/q , but only dword and qword are pure load uops. AVX2 增加了vpbroadcastb/w/d/q ,但只有 dword 和 qword 是纯负载 uops。 Byte and word broadcast-loads need an ALU shuffle uop, so for constant byte patterns you probably want to just broadcast-load a dword that repeats a byte 4 times.字节和字广播加载需要 ALU shuffle uop,因此对于恒定字节模式,您可能只想广播加载一个重复字节 4 次的双字。 (Unless it's an element from a big lookup table, then compress the table by using a byte or word broadcast load, or a pmovsx sign-extending load or whatever). (除非它是来自大型查找表的元素,然后使用字节或字广播加载或pmovsx符号扩展加载或其他方式压缩表)。

AVX512 adds vpbroadcastb/w/d/e from an integer register so you could mov eax, 0x20202020 / vpbroadcastd xmm0, eax if you have AVX512VL. AVX512 从整数寄存器中添加vpbroadcastb/w/d/e因此如果您有 AVX512VL mov eax, 0x20202020您可以mov eax, 0x20202020 / vpbroadcastd xmm0, eax


With SSE2 it would take at least 2 instructions including an ALU shuffle, like this, and may not be worth it.使用 SSE2,它至少需要 2 条指令,包括 ALU shuffle,像这样,可能不值得。

    movd    xmm0, [const_4B]
    pshufd  xmm0, xmm0, 0

Some repeating constants can be generated on the fly in a couple instructions, starting with all-ones from pcmpeqd xmm0,xmm0 .一些重复常量可以在几个指令中即时生成,从pcmpeqd xmm0,xmm0 See What are the best instruction sequences to generate vector constants on the fly?请参阅动态生成向量常量的最佳指令序列是什么? and Agner Fog's guide.和 Agner Fog 的指南。

This pattern does not appear to be easy to generate.这种模式似乎并不容易生成。 It's a byte pattern (not word, dword, or qword) and SSE shifts are only available with word granularity at best.这是一个字节模式(不是字、双字或 qword),SSE 移位最多只能在字粒度下使用。 However, if we know the bits shifted across byte boundaries are 0, it's fine.但是,如果我们知道跨字节边界移动的位是 0,那就没问题了。 eg例如

   pcmpeqd  xmm0, xmm0     ; set1( -1 )
   pabsb    xmm0, xmm0     ; set1_epi8(1)    SSSE3
   pslld    xmm0, 5        ; set1_epi8(1<<5)

; or with only SSE2, something even less efficient like shift / packsswb / shift

This is unlikely to be worth it unless you really want to avoid the possibility of a cache miss for the constant.除非您真的想避免常量缓存未命中的可能性,否则这不太值得。 On average a load will usually come out ahead.平均而言,负载通常会提前出现。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM