将单词广播到xmm寄存器

Question

I need to move a 16-bit word eight times into an xmm register for SSE operations 我需要将16位字八次移入xmm寄存器以进行SSE操作

E. g.: I'd like to work with the 16-bit word ABCD to the xmm0 register, so that the final result looks like 例如：我想在xmm0寄存器中使用16位字ABCD，以便最终结果看起来像

ABCD | ABCD | ABCD | ABCD | ABCD | ABCD | ABCD | ABCD

I want to do this in order to use the paddw operation later on. 我想这样做，以便稍后使用paddw操作。 So far I've found the pushfd operation which does what I want to do, but only for double words (32-bit). 到目前为止，我已经找到了可以执行我想做的pushfd操作，但仅适用于双字（32位）。 pshufw only works for - if I'm not mistaken - 64-bit registers. pshufw仅适用于-64位寄存器（如果我没有记错的话）。 Is there the operation I am looking for, or do I have to emulate it in some way with multiple pshufw ? 我是否正在寻找所需的操作，还是必须使用多个pshufw以某种方式模拟它？

Answer 1

You can achieve the desired goal by performing a shuffle and then an unpack. 您可以先随机播放然后再打开包装，以达到所需的目标。 In NASM syntax: 使用NASM语法：

    # load 16 bit from memory into all words of xmm0
    # assuming 16-byte alignment
    pshuflw xmm0, [mem], 0 # gives you [ M, M, M, M, ?, ?, ?, ? ]
    punpcklwd xmm0, xmm0   # gives you [ M, M, M, M, M, M, M, M ]

Note that this reads 16 bytes from mem and thus requires 16-byte alignment . 请注意，这会从mem读取16个字节，因此需要16个字节的对齐方式 。

Only the first 2 bytes are actually used. 实际上仅使用前2个字节。 If the number is not in memory or you can't guarantee that reading past the end is possible, use something like this: 如果该号码不在内存中，或者您不能保证可以读完末尾，请使用以下方法：

    # load ax into all words of xmm0
    movd      xmm0, eax                  ; or movd xmm0, [mem]  4-byte load
    pshuflw   xmm0, xmm0, 0
    punpcklwd xmm0, xmm0

With AVX2, you can use a vpbroadcast* broadcast load or a broadcast from a register source. 使用AVX2，您可以使用vpbroadcast*广播负载或来自注册源的广播。 The destination can be YMM if you like. 如果愿意，目的地可以是YMM。

    vpbroadcastw  xmm0, [mem]            ; 16-bit load + broadcast

Or 要么

    vmovd         xmm0, eax
    vpbroadcastw  xmm0, xmm0

Memory-source broadcasts of 1 or 2-byte elements still decode to a load+shuffle uop on Intel CPUs, but broadcast-loads of 4-byte or 8-byte chunks are even cheaper: handled in the load port with no shuffle uop needed. 1或2字节元素的内存源广播仍会解码为Intel CPU上的load + shuffle uop，但4字节或8字节块的广播负载甚至更便宜：在加载端口中进行处理，无需shuffle uop 。

Either way this is still cheaper than 2 separate shuffles like you need without AVX2 or SSSE3 pshufb . 无论哪种方式，这仍然比不使用AVX2或SSSE3 pshufb所需的2个单独的改组便宜。

将单词广播到xmm寄存器

问题描述

1 个解决方案

解决方案1
4 已采纳 2019-07-11 14:52:30

将单词广播到xmm寄存器

问题描述

1 个解决方案

解决方案1 4 已采纳 2019-07-11 14:52:30

解决方案1
4 已采纳 2019-07-11 14:52:30