简体   繁体   中英

Broadcast a byte value to all 16 XMM slots in Delphi ASM

This is easy in AVX with the VBROADCASTS command, or in SSE if the value were doubles or floats.

How do I broadcast a single 8-bit value to every slot in an XMM register in Delphi ASM?

Michael's answer will work. As an alternative, if you can assume the SSSE3 instruction set, then using Packed Shuffle Bytes pshufb would also work.

Assuming (1) an 8-bit value in AL (for example) and (2) the desired broadcast destination to be XMM1 , and (3) that another register, say XMM0 , is available, this will do the trick:

movd   xmm1, eax  ;// move value in AL (part of EAX) into XMM1
pxor   xmm0, xmm0 ;// clear xmm0 to create the appropriate mask for pshufb
pshufb xmm1, xmm0 ;// broadcast lowest value into all slots of xmm1

And yes, Delphi's BASM understands SSSE3.

You mean you have a byte in the LSB of an XMM register and want to duplicate it across all lanes of that register? I don't know Delphi's inline assembly syntax, but in Intel/MASM syntax it could be done something like this:

punpcklbw xmm0,xmm0    ; xxxxxxxxABCDEFGH -> xxxxxxxxEEFFGGHH
punpcklwd xmm0,xmm0    ; xxxxxxxxEEFFGGHH -> xxxxxxxxGGGGHHHH
punpckldq xmm0,xmm0    ; xxxxxxxxGGGGHHHH -> xxxxxxxxHHHHHHHH
punpcklqdq xmm0,xmm0   ; xxxxxxxxHHHHHHHH -> HHHHHHHHHHHHHHHH

The fastest option is SSSE3 for pshufb if it's available.

; SSSE3
pshufb      xmm0,  xmm1       ; where xmm1 is zeroed, e.g. with pxor xmm1,xmm1

Otherwise you should usually use this:

; SSE2 only
punpcklbw   xmm0, xmm0        ; xxxxxxxxABCDEFGH -> xxxxxxxxEEFFGGHH
pshuflw     xmm0, xmm0, 0     ; xxxxxxxxEEFFGGHH -> xxxxxxxxHHHHHHHH
punpcklqdq  xmm0, xmm0        ; xxxxxxxxHHHHHHHH -> HHHHHHHHHHHHHHHH

This is better than punpckl bw / wd -> pshufd xmm0, xmm0, 0 because there are some CPUs with only 64-bit shuffle units. (Including Merom and K8) . On such CPUs, pshuflw is fast, and so is punpcklqdq , but pshufd and punpck with granularity less than 64-bit is slow. So this sequence uses only one "slow shuffle" instruction, vs. 3 for bw / wd / pshufd.

On all later CPUs, there's no difference between those two 3-instruction sequence, so it doesn't cost us anything to tune for old CPUs in this case. See also http://agner.org/optimize/ for instruction tables.

This is the sequence from Michael's answer with the middle two instructions replaced by pshuflw .


If your byte is in an integer register to start with, you can use a multiply by 0x01010101 to broadcast it to 4 bytes. eg

; movzx   eax, whatever

imul   edx, eax, 0x01010101    ; edx = al repeated 4 times

movd   xmm0, eax
pshufd xmm0, xmm0, 0

Note that imul 's non-immediate source operand can be memory, but it has to be a 32-bit memory location with your byte zero-extended to 32 bits.


If your data starts in memory, loading into an integer register first is probably not worth it. Just movd to an xmm register. (Or possibly pinsrb if you need to avoid a wider load to avoid crossing a page or maybe a cache line. But that has a false dependency on the old value of the register where movd doesn't.)

If instruction throughput is more of an issue than latency, it can be worth considering pmuludq if you can't use pshufb , even though it has 5 cycle latency on most CPUs.

; low 32 bits of xmm0 = your byte, **zero extended**
pmuludq xmm0, xmm7        ; xmm7 = 0x01010101 in the low 32 bits
pshufd  xmm0, xmm0, 0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM