如何在AVX寄存器上打包16个16位寄存器/变量

Question

I use inline assemble, my code like this: 我使用内联汇编，我的代码如下：

__m128i inl = _mm256_castsi256_si128(in);
__m128i inh = _mm256_extractf128_si256(in, 1); 
__m128i outl, outh;
__asm__(
    "vmovq %2, %%rax                        \n\t"
    "movzwl %%ax, %%ecx                     \n\t"
    "shr $16, %%rax                         \n\t"
    "movzwl %%ax, %%edx                     \n\t"
    "movzwl s16(%%ecx, %%ecx), %%ecx        \n\t"
    "movzwl s16(%%edx, %%edx), %%edx        \n\t"
    "xorw %4, %%cx                          \n\t"
    "xorw %4, %%dx                          \n\t"
    "rolw $7, %%cx                          \n\t"
    "rolw $7, %%dx                          \n\t"
    "movzwl s16(%%ecx, %%ecx), %%ecx        \n\t"
    "movzwl s16(%%edx, %%edx), %%edx        \n\t"
    "pxor %0, %0                            \n\t"
    "vpinsrw $0, %%ecx, %0, %0              \n\t"
    "vpinsrw $1, %%edx, %0, %0              \n\t"

: "=x" (outl), "=x" (outh)
: "x" (inl), "x" (inh), "r" (subkey)
: "%rax", "%rcx", "%rdx"
);

I omit some vpinsrw in my code, this is more or less to show the principle. 我在代码中省略了一些vpinsrw，这或多或少地显示了原理。 The real code uses 16 vpinsrw operations. 实际代码使用16个vpinsrw操作。 But the output doesn't match the expected. 但是输出与预期不符。

b0f0 849f 446b 4e4e e553 b53b 44f7 552b 67d  1476 a3c7 ede8 3a1f f26c 6327 bbde
e553 b53b 44f7 552b    0    0    0    0 b4b3 d03e 6d4b c5ba 6680 1440 c688 ea36

the first line is the true answer, and the second line is my result. 第一行是正确的答案，第二行是我的结果。 the C code is here: C代码在这里：

for(i = 0; i < 16; i++)
{  
    arr[i] = (u16)(s16[arr[i]] ^ subkey);
    arr[i] = (arr[i] << 7) | (arr[i] >> 9);
    arr[i] = s16[arr[i]];

}

My task is make this code faster. 我的任务是使此代码更快。

in older code, data move to stack from ymm, and then move to 16 byte register from stack like this . 在旧代码中，数据从ymm移到堆栈，然后像这样从堆栈移到16字节寄存器。 so i want to move data directly to 16 byte register from ymm. 所以我想将数据直接从ymm移到16字节寄存器。

__asm__(     

    "vmovdqa %0, -0xb0(%%rbp)               \n\t"

    "movzwl -0xb0(%%rbp), %%ecx             \n\t"
    "movzwl -0xae(%%rbp), %%eax             \n\t"
    "movzwl s16(%%ecx, %%ecx), %%ecx        \n\t"
    "movzwl s16(%%eax, %%eax), %%eax        \n\t"
    "xorw %1, %%cx                          \n\t"
    "xorw %1, %%ax                          \n\t"
    "rolw $7, %%cx                          \n\t"
    "rolw $7, %%ax                          \n\t"
    "movzwl s16(%%ecx, %%ecx), %%ecx        \n\t"
    "movzwl s16(%%eax, %%eax), %%eax        \n\t"
    "movw %%cx, -0xb0(%%rbp)                \n\t"
    "movw %%ax, -0xae(%%rbp)                \n\t"

Answer 1

An Skylake (where gather is fast), it might well be a win to chain two gathers together using Aki's answer. 一个Skylake（聚集速度很快），使用Aki的答案将两个聚集链接在一起可能是一次胜利。 That lets you do the rotate very efficiently using vector-integer stuff. 这样一来，您就可以使用向量整数填充非常有效地进行旋转。

On Haswell, it might be faster to keep using your scalar code, depending on what the surrounding code looks like. 在Haswell上，继续使用标量代码可能会更快，具体取决于周围代码的外观。 (Or maybe doing the vector rotate+xor with vector code is still a win. Try it and see.) （或者用矢量代码进行矢量rotate + xor仍然是一个胜利。尝试一下。）

You have one really bad performance mistake, and a couple other problems: 您有一个非常糟糕的性能错误，还有两个其他问题：

"pxor %0, %0                            \n\t"
"vpinsrw $0, %%ecx, %0, %0              \n\t"

Using a legacy-SSE pxor to zero the low 128b of %0 while leaving the upper 128b unmodified will cause an SSE-AVX transition penalty on Haswell; 使用传统SSE pxor将%0的低位128b设为零，而未修改高位128b将导致Haswell发生SSE-AVX过渡损失； about 70 cycles each on the pxor and the first vpinsrw , I think. 我认为，在pxor和第一个vpinsrw上每个周期大约70个周期。 On Skylake, it will only be slightly slower , and have a false dependency. 在Skylake上，它只会稍慢一些，并且具有错误的依赖关系。

Instead, use vmovd %%ecx, %0 , which zeros the upper bytes of the vector reg (thus breaking the dependency on the old value). 而是使用vmovd %%ecx, %0将向量reg的高字节清零（从而打破对旧值的依赖性）。

Actually, use 其实使用

"vmovd        s16(%%rcx, %%rcx), %0       \n\t"   // leaves garbage in element 1, which you over-write right away
"vpinsrw  $1, s16(%%rdx, %%rdx), %0, %0   \n\t"
...

It's a huge waste of instructions (and uops) to load into integer registers and then go from there into vectors, when you could insert directly into vectors . 当您可以直接插入vectors时，将它们加载到整型寄存器中然后再转入vectors会浪费大量的指令（和uops） 。

Your indices are already zero-extended, so I used 64-bit addressing modes to avoid wasting an address-size prefix on each instruction. 您的索引已被零扩展，因此我使用64位寻址模式以避免在每条指令上浪费地址大小的前缀。 (Since your table is static , it's in the low 2G of virtual address space (in the default code-model), so 32-bit addressing did actually work, but it gained you nothing.) （由于您的表是static ，因此它处于2G的虚拟地址空间的低位（在默认代码模型中），因此32位寻址确实有效，但是却无济于事。）

I experimented a while ago with getting scalar LUT results (for GF16 multiply) into vectors, tuning for Intel Sandybridge. 我前段时间做了实验，将标量LUT结果（对于GF16乘以）转换为向量，并针对Intel Sandybridge进行了调整。 I wasn't chaining the LUT lookups like you are, though. 不过，我并没有像您那样链接LUT查找。 See https://github.com/pcordes/par2-asm-experiments . 参见https://github.com/pcordes/par2-asm-experiments 。 I kind of abandoned it after finding out that GF16 is more efficient with pshufb as a 4-bit LUT, but anyway I found that pinsrw from memory into a vector was good if you don't have gather instructions. 在发现GF16使用pshufb作为4位LUT更有效之后，我有点放弃了它，但是无论如何，我发现如果没有收集指令，从内存到向量的pinsrw很好。

You might want to give more ILP by interleaving operations on two vectors at once. 您可能希望通过一次对两个向量进行交织操作来提供更多的ILP。 Or maybe even into the low 64b of 4 vectors, and combine with vpunpcklqdq . 或者甚至进入4个向量的低64b，然后与vpunpcklqdq结合。 ( vmovd is faster that vpinsrw , so it's pretty much break-even on uop throughput.) （ vmovd更快，因此vpinsrw吞吐量几乎可以达到收支平衡。）

"xorw %4, %%cx                          \n\t"
"xorw %4, %%dx                          \n\t"

These can and should be xor %[subkey], %%ecx . 这些可以并且应该是xor %[subkey], %%ecx 。 32-bit operand-size is more efficient here, and works fine as long as your input doesn't have any bits set in the upper 16. Use a [subkey] "ri" (subkey) constraint to allow an immediate value when it's known at compile-time. 32位操作数大小在这里更有效，并且只要您的输入的高16位没有设置任何位，它就可以很好地工作。使用[subkey] "ri" (subkey)约束可以在输入时使用立即数在编译时已知。 (That's probably better, and reduces register pressure slightly, but at the expense of code-size since you use it many times.) （这可能更好，并且可以稍微降低寄存器压力，但是由于您多次使用它，因此以代码大小为代价。）

The rolw instructions have to stay 16-bit, though. 不过， rolw指令必须保留16位。

You could consider packing two or four values into an integer register (with movzwl s16(...), %%ecx / shl $16, %%ecx / mov s16(...), %cx / shl $16, %%rcx / ...), but then you'd have to emulate the rotates with shifting / or and masking. 您可以考虑将两个或四个值打包到整数寄存器中（使用movzwl s16(...), %%ecx / shl $16, %%ecx / mov s16(...), %cx / shl $16, %%rcx / ...），但随后您必须使用移位/或和遮罩来模拟旋转。 And unpack again to reuse them as indices. 并再次解压缩以将其重新用作索引。

It's too bad the integer stuff comes between two LUT lookups, otherwise you could do it in a vector before unpacking. 整数填充在两次LUT查找之间是非常糟糕的，否则您可以在解压缩之前在向量中进行处理。

You strategy for extracting 16b chunks of a vector looks pretty good. 您提取向量的16b块的策略看起来不错。 movdq from xmm to GP register runs on port 0 on Haswell/Skylake, and shr / ror runs on port0 / port6. 从xmm到GP寄存器的movdq在Haswell / Skylake的端口0上运行，而shr / ror在端口0 /端口6上运行。 So you do compete for ports some, but storing the whole vector and reloading it would take more load ports. 因此，您确实需要争夺端口，但是存储整个向量并重新加载它会占用更多的加载端口。

Might be worth trying doing a 256b store, but still get the low 64b from a vmovq so the first 4 elements can get started without as much latency. 可能值得尝试进行256b的存储，但是仍然可以从vmovq获得低64b的存储，因此可以在没有太多延迟的情况下启动前4个元素。

As for getting the wrong answer: use a debugger. 至于得到错误的答案：请使用调试器。 Debuggers work very well for asm; 调试器对于asm的工作非常好； see the end of the x86 tag wiki for some tips on using GDB. 有关使用GDB的一些提示，请参见x86 标签Wiki的末尾。

Look at the compiler-generated code that interfaces between your asm and what the compiler is doing: maybe you got a constraint wrong. 查看在您的asm与编译器正在执行的操作之间生成的编译器生成的代码：也许您遇到了约束错误。

Maybe you got mixed up with %0 or %1 or something. 也许您与%0或%1东西混在一起了。 I'd definitely recommend using %[name] instead of operand numbers. 我绝对建议使用%[name]代替操作数。 See also the inline-assembly tag wiki for links to guides. 另请参阅inline-assembly 标签wiki ，以获取指南的链接。

C version that avoids inline asm (but gcc wastes instructions on it). 避免内联asm的C版本（但gcc浪费了它的说明）。

You don't need inline-asm for this at all, unless your compiler is doing a bad job unpacking the vector to 16-bit elements, and not generating the code you want. 您根本不需要inline-asm，除非编译器在将向量解压缩为16位元素并且不生成所需代码的过程中做得很糟糕。 https://gcc.gnu.org/wiki/DontUseInlineAsm https://gcc.gnu.org/wiki/DontUseInlineAsm

I put this up on Matt Godbolt's compiler explorer where you can see the asm output. 我将其放在Matt Godbolt的编译器资源管理器中，您可以在其中看到asm输出。

// This probably compiles to code like your inline asm
#include <x86intrin.h>
#include <stdint.h>

extern const uint16_t s16[];

__m256i LUT_elements(__m256i in)
{
    __m128i inl = _mm256_castsi256_si128(in);
    __m128i inh = _mm256_extractf128_si256(in, 1);

    unsigned subkey = 8;
    uint64_t low4 = _mm_cvtsi128_si64(inl);  // movq extract the first elements
    unsigned idx = (uint16_t)low4;
    low4 >>= 16;

    idx = s16[idx] ^ subkey;
    idx = __rolw(idx, 7);
    // cast to a 32-bit pointer to convince gcc to movd directly from memory
    // the strict-aliasing violation won't hurt since the table is const.

    __m128i outl = _mm_cvtsi32_si128(*(const uint32_t*)&s16[idx]);

    unsigned idx2 = (uint16_t)low4;
    idx2 = s16[idx2] ^ subkey;
    idx2 = __rolw(idx2, 7);
    outl = _mm_insert_epi16(outl, s16[idx2], 1);

    // ... do the rest of the elements

    __m128i outh = _mm_setzero_si128();  // dummy upper half
    return _mm256_inserti128_si256(_mm256_castsi128_si256(outl), outh, 1);
}

I had to pointer-cast to get a vmovd directly from the LUT into a vector for the first s16[idx] . 我必须进行指针转换才能将vmovd直接从LUT转换为第一个s16[idx]的向量。 Without that, gcc uses a movzx load into an integer reg and then a vmovd from there. 否则，gcc会先将movzx负载加载到整数reg中，然后再从其中加载vmovd 。 That avoids any risk of a cache-line split or page-split from doing a 32-bit load, but that risk may be worth it for average throughput since this probably bottlenecks on front-end uop throughput. 这样可以避免缓存行拆分或页面拆分进行32位加载的任何风险，但是对于平均吞吐量而言，这种风险值得承担，因为这可能会限制前端uop吞吐量。

Note the use of __rolw from x86intrin.h. 注意__rolw中__rolw的使用。 gcc supports it, but clang doesn't . gcc支持它，但是clang不支持。 It compiles to a 16-bit rotate with no extra instructions. 无需额外的指令即可编译为16位循环。

Unfortunately gcc doesn't realize that the 16-bit rotate keeps the upper bits of the register zeroed, so it does a pointless movzwl %dx, %edx before using %rdx as an index. 不幸的是，gcc并没有意识到16位的旋转会将寄存器的高位保持为零，因此在使用%rdx作为索引之前，它会进行毫无意义的movzwl %dx, %edx 。 This is a problem even with gcc7.1 and 8-snapshot. 即使使用gcc7.1和8-snapshot，这也是一个问题。

And BTW, gcc loads the s16 table address into a register, so it can use addressing modes like vmovd (%rcx,%rdx,2), %xmm0 instead of embedding the 4-byte address into every instruction. 顺便说一句，gcc将s16表地址加载到寄存器中，因此它可以使用诸如vmovd (%rcx,%rdx,2), %xmm0类的寻址模式vmovd (%rcx,%rdx,2), %xmm0而不是将4字节地址嵌入到每个指令中。

Since the extra movzx is the only thing gcc is doing worse than you could do by hand, you might consider just making a rotate-by-7 function in inline asm that gcc thinks takes 32 or 64-bit input registers. 由于多余的movzx是gcc唯一比您手可以做的事情差的事情，因此您可能会考虑在gcc认为需要32或64位输入寄存器的内联asm中制作一个7旋转功能。 (Use something like this to get a "half" sized rotate, ie 16 bits: （使用类似的方法来获得“一半”大小的旋转，即16位：

// pointer-width integers don't need to be re-extended
// but since gcc doesn't understand the asm, it thinks the whole 64-bit result may be non-zero
static inline
uintptr_t my_rolw(uintptr_t a, int count) {
    asm("rolw %b[count], %w[val]" : [val]"+r"(a) : [count]"ic"(count));
    return a;
}

However, even with that, gcc still wants to emit useless movzx or movl instructions. 但是，即使这样，gcc仍然希望发出无用的movzx或movl指令。 I got rid of some zero-extension by using wider types for idx , but there are still problems. 通过为idx使用更广泛的类型，我摆脱了一些零扩展的问题，但是仍然存在问题。 ( source on the compiler explorer ). （源于编译器资源管理器）。 Having subkey a function arg instead of compile-time constant helps, for some reason. 出于某种原因，让subkey使用函数arg而不是编译时常量会有所帮助。

You might be able to get gcc to assume something is a zero-extended 16-bit value with: 您也许可以让gcc假设某物是零扩展的16位值，其中包括：

if (x > 65535)
    __builtin_unreachable();

Then you could completely drop any inline asm, and just use __rolw . 然后，您可以完全删除任何嵌入式asm，只需使用__rolw 。

But beware that icc will compile that to an actual check and then a jump beyond the end of the function. 但是请注意， icc会将其编译为实际检查，然后跳转到函数末尾。 It should work for gcc, but I didn't test. 它应该适用于gcc，但我没有测试。

It's pretty reasonable to just write the whole thing in inline asm if it takes this much tweaking to get the compiler not to shoot itself in the foot, though. 但是，如果花了很多时间才能使编译器不致于陷入僵局，则只用内联asm编写整个代码是很合理的。

Answer 2

The inline assembler resembles slightly the C code, so I would be tempted to assume that these two are meant to be the same. 内联汇编程序与C代码略有相似，因此我很想假设这两个代码是相同的。

This is primarily an opinion, but I would suggest using intrinsics instead of the extended assembler. 这主要是一种意见，但我建议使用内部函数而不是扩展汇编程序。 Intrinsics allow register allocation and variable optimization done by the compiler, as well as portability -- each vector operation can be emulated by a function in absence of the target instruction set. 内部特性允许编译器完成寄存器分配和变量优化以及可移植性-每个向量操作都可以在没有目标指令集的情况下由函数进行仿真。

Next issue is that inlined source code appears to handle the substitution block arr[i] = s16[arr[i]] for two indices i only. 下一个问题是内联源代码似乎只处理两个索引i的替换块arr[i] = s16[arr[i]] 。 Using AVX2, this should be done by either two gather operations, since a Y-register can hold only 8 uint32_ts or offsets to the lookup table, OR when it's available, the substitution stage should be performed by analytical functions that can be run in parallel. 使用AVX2，这应该通过两个收集操作来完成，因为Y寄存器只能保存8个uint32_ts或查找表的偏移量，或者在可用时，替换阶段应该由可以并行运行的分析函数执行。。

Using intrinsics, the operation could look something like this. 使用内在函数，操作可能看起来像这样。

__m256i function(uint16_t *input_array, uint16_t subkey) {
  __m256i array = _mm256_loadu_si256((__m256i*)input_array);
          array = _mm256_xor_si256(array, _mm256_set_epi16(subkey));
  __m256i even_sequence = _mm256_and_si256(array, _mm256_set_epi32(0xffff));
  __m256i odd_sequence = _mm256_srli_epi32(array, 16);
  even_sequence = _mm256_gather_epi32(LUT, even_sequence, 4);
  odd_sequence = _mm256_gather_epi32(LUT, odd_sequence, 4);
  // rotate
  __m256i hi = _mm256_slli_epi16(even_sequence, 7);
  __m256i lo = _mm256_srli_epi16(even_sequence, 9);
  even_sequence = _mm256_or_si256(hi, lo);
  // same for odd
  hi = _mm256_slli_epi16(odd_sequence, 7);
  lo = _mm256_srli_epi16(odd_sequence, 9);
  odd_sequence = _mm256_or_si256(hi, lo);
  // Another substitution
  even_sequence = _mm256_gather_epi32(LUT, even_sequence, 4);
  odd_sequence = _mm256_gather_epi32(LUT, odd_sequence, 4);
  // recombine -- shift odd by 16 and OR
  odd_sequence = _mm256_slli_epi32(odd_sequence, 16);
  return _mm256_or_si256(even_sequence, odd_sequence);

} }

With optimizations a decent compiler will generate about one assembler instruction per statement; 通过优化，一个不错的编译器将为每个语句生成大约一个汇编程序指令。 without optimizations all the intermediate variables are spilled to stack to be easily debugged. 没有优化，所有中间变量都会溢出到堆栈中，以便于调试。

如何在AVX寄存器上打包16个16位寄存器/变量

问题描述

2 个解决方案

解决方案1
4 已采纳 2017-08-13 04:04:32

C version that avoids inline asm (but gcc wastes instructions on it). 避免内联asm的C版本（但gcc浪费了它的说明）。

解决方案2
2 2017-08-12 06:32:46

如何在AVX寄存器上打包16个16位寄存器/变量

问题描述

2 个解决方案

解决方案1 4 已采纳 2017-08-13 04:04:32

C version that avoids inline asm (but gcc wastes instructions on it). 避免内联asm的C版本（但gcc浪费了它的说明）。

解决方案2 2 2017-08-12 06:32:46

解决方案1
4 已采纳 2017-08-13 04:04:32

解决方案2
2 2017-08-12 06:32:46