简体   繁体   中英

How to pack 16 16-bit registers/variables on AVX registers

I use inline assemble, my code like this:

__m128i inl = _mm256_castsi256_si128(in);
__m128i inh = _mm256_extractf128_si256(in, 1); 
__m128i outl, outh;
__asm__(
    "vmovq %2, %%rax                        \n\t"
    "movzwl %%ax, %%ecx                     \n\t"
    "shr $16, %%rax                         \n\t"
    "movzwl %%ax, %%edx                     \n\t"
    "movzwl s16(%%ecx, %%ecx), %%ecx        \n\t"
    "movzwl s16(%%edx, %%edx), %%edx        \n\t"
    "xorw %4, %%cx                          \n\t"
    "xorw %4, %%dx                          \n\t"
    "rolw $7, %%cx                          \n\t"
    "rolw $7, %%dx                          \n\t"
    "movzwl s16(%%ecx, %%ecx), %%ecx        \n\t"
    "movzwl s16(%%edx, %%edx), %%edx        \n\t"
    "pxor %0, %0                            \n\t"
    "vpinsrw $0, %%ecx, %0, %0              \n\t"
    "vpinsrw $1, %%edx, %0, %0              \n\t"

: "=x" (outl), "=x" (outh)
: "x" (inl), "x" (inh), "r" (subkey)
: "%rax", "%rcx", "%rdx"
);

I omit some vpinsrw in my code, this is more or less to show the principle. The real code uses 16 vpinsrw operations. But the output doesn't match the expected.

b0f0 849f 446b 4e4e e553 b53b 44f7 552b 67d  1476 a3c7 ede8 3a1f f26c 6327 bbde
e553 b53b 44f7 552b    0    0    0    0 b4b3 d03e 6d4b c5ba 6680 1440 c688 ea36

the first line is the true answer, and the second line is my result. the C code is here:

for(i = 0; i < 16; i++)
{  
    arr[i] = (u16)(s16[arr[i]] ^ subkey);
    arr[i] = (arr[i] << 7) | (arr[i] >> 9);
    arr[i] = s16[arr[i]];

}

My task is make this code faster.

in older code, data move to stack from ymm, and then move to 16 byte register from stack like this . so i want to move data directly to 16 byte register from ymm.

__asm__(     

    "vmovdqa %0, -0xb0(%%rbp)               \n\t"

    "movzwl -0xb0(%%rbp), %%ecx             \n\t"
    "movzwl -0xae(%%rbp), %%eax             \n\t"
    "movzwl s16(%%ecx, %%ecx), %%ecx        \n\t"
    "movzwl s16(%%eax, %%eax), %%eax        \n\t"
    "xorw %1, %%cx                          \n\t"
    "xorw %1, %%ax                          \n\t"
    "rolw $7, %%cx                          \n\t"
    "rolw $7, %%ax                          \n\t"
    "movzwl s16(%%ecx, %%ecx), %%ecx        \n\t"
    "movzwl s16(%%eax, %%eax), %%eax        \n\t"
    "movw %%cx, -0xb0(%%rbp)                \n\t"
    "movw %%ax, -0xae(%%rbp)                \n\t"

An Skylake (where gather is fast), it might well be a win to chain two gathers together using Aki's answer. That lets you do the rotate very efficiently using vector-integer stuff.

On Haswell, it might be faster to keep using your scalar code, depending on what the surrounding code looks like. (Or maybe doing the vector rotate+xor with vector code is still a win. Try it and see.)

You have one really bad performance mistake, and a couple other problems:

"pxor %0, %0                            \n\t"
"vpinsrw $0, %%ecx, %0, %0              \n\t"

Using a legacy-SSE pxor to zero the low 128b of %0 while leaving the upper 128b unmodified will cause an SSE-AVX transition penalty on Haswell; about 70 cycles each on the pxor and the first vpinsrw , I think. On Skylake, it will only be slightly slower , and have a false dependency.

Instead, use vmovd %%ecx, %0 , which zeros the upper bytes of the vector reg (thus breaking the dependency on the old value).

Actually, use

"vmovd        s16(%%rcx, %%rcx), %0       \n\t"   // leaves garbage in element 1, which you over-write right away
"vpinsrw  $1, s16(%%rdx, %%rdx), %0, %0   \n\t"
...

It's a huge waste of instructions (and uops) to load into integer registers and then go from there into vectors, when you could insert directly into vectors .

Your indices are already zero-extended, so I used 64-bit addressing modes to avoid wasting an address-size prefix on each instruction. (Since your table is static , it's in the low 2G of virtual address space (in the default code-model), so 32-bit addressing did actually work, but it gained you nothing.)

I experimented a while ago with getting scalar LUT results (for GF16 multiply) into vectors, tuning for Intel Sandybridge. I wasn't chaining the LUT lookups like you are, though. See https://github.com/pcordes/par2-asm-experiments . I kind of abandoned it after finding out that GF16 is more efficient with pshufb as a 4-bit LUT, but anyway I found that pinsrw from memory into a vector was good if you don't have gather instructions.

You might want to give more ILP by interleaving operations on two vectors at once. Or maybe even into the low 64b of 4 vectors, and combine with vpunpcklqdq . ( vmovd is faster that vpinsrw , so it's pretty much break-even on uop throughput.)


"xorw %4, %%cx                          \n\t"
"xorw %4, %%dx                          \n\t"

These can and should be xor %[subkey], %%ecx . 32-bit operand-size is more efficient here, and works fine as long as your input doesn't have any bits set in the upper 16. Use a [subkey] "ri" (subkey) constraint to allow an immediate value when it's known at compile-time. (That's probably better, and reduces register pressure slightly, but at the expense of code-size since you use it many times.)

The rolw instructions have to stay 16-bit, though.

You could consider packing two or four values into an integer register (with movzwl s16(...), %%ecx / shl $16, %%ecx / mov s16(...), %cx / shl $16, %%rcx / ...), but then you'd have to emulate the rotates with shifting / or and masking. And unpack again to reuse them as indices.

It's too bad the integer stuff comes between two LUT lookups, otherwise you could do it in a vector before unpacking.


You strategy for extracting 16b chunks of a vector looks pretty good. movdq from xmm to GP register runs on port 0 on Haswell/Skylake, and shr / ror runs on port0 / port6. So you do compete for ports some, but storing the whole vector and reloading it would take more load ports.

Might be worth trying doing a 256b store, but still get the low 64b from a vmovq so the first 4 elements can get started without as much latency.


As for getting the wrong answer: use a debugger. Debuggers work very well for asm; see the end of the tag wiki for some tips on using GDB.

Look at the compiler-generated code that interfaces between your asm and what the compiler is doing: maybe you got a constraint wrong.

Maybe you got mixed up with %0 or %1 or something. I'd definitely recommend using %[name] instead of operand numbers. See also the tag wiki for links to guides.


C version that avoids inline asm (but gcc wastes instructions on it).

You don't need inline-asm for this at all, unless your compiler is doing a bad job unpacking the vector to 16-bit elements, and not generating the code you want. https://gcc.gnu.org/wiki/DontUseInlineAsm

I put this up on Matt Godbolt's compiler explorer where you can see the asm output.

// This probably compiles to code like your inline asm
#include <x86intrin.h>
#include <stdint.h>

extern const uint16_t s16[];

__m256i LUT_elements(__m256i in)
{
    __m128i inl = _mm256_castsi256_si128(in);
    __m128i inh = _mm256_extractf128_si256(in, 1);

    unsigned subkey = 8;
    uint64_t low4 = _mm_cvtsi128_si64(inl);  // movq extract the first elements
    unsigned idx = (uint16_t)low4;
    low4 >>= 16;

    idx = s16[idx] ^ subkey;
    idx = __rolw(idx, 7);
    // cast to a 32-bit pointer to convince gcc to movd directly from memory
    // the strict-aliasing violation won't hurt since the table is const.

    __m128i outl = _mm_cvtsi32_si128(*(const uint32_t*)&s16[idx]);

    unsigned idx2 = (uint16_t)low4;
    idx2 = s16[idx2] ^ subkey;
    idx2 = __rolw(idx2, 7);
    outl = _mm_insert_epi16(outl, s16[idx2], 1);

    // ... do the rest of the elements

    __m128i outh = _mm_setzero_si128();  // dummy upper half
    return _mm256_inserti128_si256(_mm256_castsi128_si256(outl), outh, 1);
}

I had to pointer-cast to get a vmovd directly from the LUT into a vector for the first s16[idx] . Without that, gcc uses a movzx load into an integer reg and then a vmovd from there. That avoids any risk of a cache-line split or page-split from doing a 32-bit load, but that risk may be worth it for average throughput since this probably bottlenecks on front-end uop throughput.

Note the use of __rolw from x86intrin.h. gcc supports it, but clang doesn't . It compiles to a 16-bit rotate with no extra instructions.

Unfortunately gcc doesn't realize that the 16-bit rotate keeps the upper bits of the register zeroed, so it does a pointless movzwl %dx, %edx before using %rdx as an index. This is a problem even with gcc7.1 and 8-snapshot.

And BTW, gcc loads the s16 table address into a register, so it can use addressing modes like vmovd (%rcx,%rdx,2), %xmm0 instead of embedding the 4-byte address into every instruction.

Since the extra movzx is the only thing gcc is doing worse than you could do by hand, you might consider just making a rotate-by-7 function in inline asm that gcc thinks takes 32 or 64-bit input registers. (Use something like this to get a "half" sized rotate, ie 16 bits:

// pointer-width integers don't need to be re-extended
// but since gcc doesn't understand the asm, it thinks the whole 64-bit result may be non-zero
static inline
uintptr_t my_rolw(uintptr_t a, int count) {
    asm("rolw %b[count], %w[val]" : [val]"+r"(a) : [count]"ic"(count));
    return a;
}

However, even with that, gcc still wants to emit useless movzx or movl instructions. I got rid of some zero-extension by using wider types for idx , but there are still problems. ( source on the compiler explorer ). Having subkey a function arg instead of compile-time constant helps, for some reason.

You might be able to get gcc to assume something is a zero-extended 16-bit value with:

if (x > 65535)
    __builtin_unreachable();

Then you could completely drop any inline asm, and just use __rolw .

But beware that icc will compile that to an actual check and then a jump beyond the end of the function. It should work for gcc, but I didn't test.

It's pretty reasonable to just write the whole thing in inline asm if it takes this much tweaking to get the compiler not to shoot itself in the foot, though.

The inline assembler resembles slightly the C code, so I would be tempted to assume that these two are meant to be the same.

This is primarily an opinion, but I would suggest using intrinsics instead of the extended assembler. Intrinsics allow register allocation and variable optimization done by the compiler, as well as portability -- each vector operation can be emulated by a function in absence of the target instruction set.

Next issue is that inlined source code appears to handle the substitution block arr[i] = s16[arr[i]] for two indices i only. Using AVX2, this should be done by either two gather operations, since a Y-register can hold only 8 uint32_ts or offsets to the lookup table, OR when it's available, the substitution stage should be performed by analytical functions that can be run in parallel.

Using intrinsics, the operation could look something like this.

__m256i function(uint16_t *input_array, uint16_t subkey) {
  __m256i array = _mm256_loadu_si256((__m256i*)input_array);
          array = _mm256_xor_si256(array, _mm256_set_epi16(subkey));
  __m256i even_sequence = _mm256_and_si256(array, _mm256_set_epi32(0xffff));
  __m256i odd_sequence = _mm256_srli_epi32(array, 16);
  even_sequence = _mm256_gather_epi32(LUT, even_sequence, 4);
  odd_sequence = _mm256_gather_epi32(LUT, odd_sequence, 4);
  // rotate
  __m256i hi = _mm256_slli_epi16(even_sequence, 7);
  __m256i lo = _mm256_srli_epi16(even_sequence, 9);
  even_sequence = _mm256_or_si256(hi, lo);
  // same for odd
  hi = _mm256_slli_epi16(odd_sequence, 7);
  lo = _mm256_srli_epi16(odd_sequence, 9);
  odd_sequence = _mm256_or_si256(hi, lo);
  // Another substitution
  even_sequence = _mm256_gather_epi32(LUT, even_sequence, 4);
  odd_sequence = _mm256_gather_epi32(LUT, odd_sequence, 4);
  // recombine -- shift odd by 16 and OR
  odd_sequence = _mm256_slli_epi32(odd_sequence, 16);
  return _mm256_or_si256(even_sequence, odd_sequence);

}

With optimizations a decent compiler will generate about one assembler instruction per statement; without optimizations all the intermediate variables are spilled to stack to be easily debugged.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM