I need to move a 16-bit word eight times into an xmm register for SSE operations
E. g.: I'd like to work with the 16-bit word ABCD to the xmm0 register, so that the final result looks like
ABCD | ABCD | ABCD | ABCD | ABCD | ABCD | ABCD | ABCD
I want to do this in order to use the paddw
operation later on. So far I've found the pushfd
operation which does what I want to do, but only for double words (32-bit). pshufw
only works for - if I'm not mistaken - 64-bit registers. Is there the operation I am looking for, or do I have to emulate it in some way with multiple pshufw
?
You can achieve the desired goal by performing a shuffle and then an unpack. In NASM syntax:
# load 16 bit from memory into all words of xmm0
# assuming 16-byte alignment
pshuflw xmm0, [mem], 0 # gives you [ M, M, M, M, ?, ?, ?, ? ]
punpcklwd xmm0, xmm0 # gives you [ M, M, M, M, M, M, M, M ]
Note that this reads 16 bytes from mem
and thus requires 16-byte alignment .
Only the first 2 bytes are actually used. If the number is not in memory or you can't guarantee that reading past the end is possible, use something like this:
# load ax into all words of xmm0
movd xmm0, eax ; or movd xmm0, [mem] 4-byte load
pshuflw xmm0, xmm0, 0
punpcklwd xmm0, xmm0
With AVX2, you can use a vpbroadcast*
broadcast load or a broadcast from a register source. The destination can be YMM if you like.
vpbroadcastw xmm0, [mem] ; 16-bit load + broadcast
Or
vmovd xmm0, eax
vpbroadcastw xmm0, xmm0
Memory-source broadcasts of 1 or 2-byte elements still decode to a load+shuffle uop on Intel CPUs, but broadcast-loads of 4-byte or 8-byte chunks are even cheaper: handled in the load port with no shuffle uop needed.
Either way this is still cheaper than 2 separate shuffles like you need without AVX2 or SSSE3 pshufb
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.