使用SIMD指令执行任意128/256/512位置换的最快方法是什么？

Question

I want to perform an arbitrary permutation of single bits, pairs of bits, and nibbles (4 bits) on a CPU register (xmm, ymm or zmm) of width 128, 256 or 512 bits; 我想在宽度为128,256或512位的CPU寄存器（xmm，ymm或zmm）上执行单个位，位对和半字节（4位）的任意置换; this should be as fast as possible. 这应该尽可能快。 For this I was looking into SIMD instructions. 为此，我正在研究SIMD指令。 Does anyone know of a way to do this/a library that implements it? 有没有人知道这样做的方法/实现它的库？ I'm using MSVC on Windows and GCC on Linux, and the host language is C or C++. 我在Windows上使用MSVC，在Linux上使用GCC，主机语言是C或C ++。 Thanks! 谢谢！

I'm given an arbitrary permutation and need to shuffle a large number of bit vectors/pairs of bit vectors/nibbles. 我给了一个任意的排列，需要改组大量的位向量/对位向量/半字节。 I know how to do this for the bits within a 64 bit value, eg using a Benes network . 我知道如何为64位值内的位执行此操作，例如使用Benes网络。

Or shuffling blocks of 8-bit and larger around on the wider SIMD registers, eg using Agner Fog's GPLed VectorClass library ( https://www.agner.org/optimize/vectorclass.pdf ) for a template metaprogramming function that builds shuffles out of AVX2 in-lane byte shuffles and/or larger-element lane-crossing shuffles, given the shuffle as template parameter. 或者在更广泛的SIMD寄存器上改组8位及更大的块，例如使用Agner Fog的GPLed VectorClass库（ https://www.agner.org/optimize/vectorclass.pdf ）获取模板元编程函数，该函数可以构建shuffle在给予shuffle作为模板参数的情况下，AVX2通道内字节混洗和/或大元素通道混洗。

A more granular subdivision for permutations - into 1, 2 or 4 bit blocks - seems to be hard to achieve across wide vectors, though. 然而，对于置换的更细粒度细分 - 分为1,2或4位块 - 似乎很难在宽向量上实现。

I'm able to do pre-processing on the permutation, eg to extract bit masks, calculate indices as necessary eg for a Benes network, or whatever else - happy to do that in another high level language as well, so assume that the permutation is given in whatever format is most convenient to solve the problem; 我能够对排列进行预处理，例如提取位掩码，根据需要计算索引，例如Benes网络，或者其他任何东西 - 很高兴在另一种高级语言中这样做，所以假设排列以最方便解决问题的格式给出; small-ish lookup tables included. 包括小型查找表。

I would expect the code to be significantly faster than doing something like 我希望代码比做类似的事情要快得多

// actually 1 bit per element, not byte.  I want a 256-bit bit-shuffle
const uint8_t in[256] = get_some_vector(); // not a compile-time constant
const uint8_t perm[256] = ...;             // compile-time constant
uint8_t out[256];
for (size_t i = 0; i < 256; i ++)
    out[i] = in[perm[i]];

As I said, I have a solution for <= 64 bits (which would be 64 bits, 32 bit-pairs, and 16 nibbles). 正如我所说的，我有一个<= 64位（64位，32位对和16位半字节）的解决方案。 The problem is also solved for blocks of size 8, 16, 32 etc. on wider SIMD registers. 在较宽的SIMD寄存器上，对于大小为8,16,32等的块也解决了该问题。

EDIT: to clarify, the permutation is a compile-time constant (but not just one particular one, I'll compile the program once per permutation given). 编辑：澄清一下，置换是一个编译时常量（但不仅仅是一个特定的，我将按照给定的排列编译程序一次）。

Answer 1

The AVX2 256 bit permutation case AVX2 256位置换情况

I do not think it is possible to write an efficient generic SSE4/AVX2/AVX-512 algorithm that works for all vector sizes (128, 256, 512 bits), and element granularities (bits, bit pairs, nibbles, bytes). 我认为不可能编写一个有效的通用SSE4 / AVX2 / AVX-512算法，该算法适用于所有矢量大小（128,256,512位）和元素粒度（位，位对，半字节，字节）。 One problem is that many AVX2 instructions that exist for, for example, byte size elements, do not exist for double word elements, and vice versa. 一个问题是，对于双字元素，存在用于例如字节大小元素的许多AVX2指令，反之亦然。

Below the AVX2 256 bit permutation case is discussed. 下面讨论AVX2 256位置换情况。 It might be possible to recycle the ideas of this case for other cases. 对于其他案例，有可能回收本案例的想法。

The idea is to extract 32 (permuted) bits per step from input vector x . 想法是从输入向量x每步提取32（置换）比特。 In each step 32 bytes from permutation vector pos are read. 在每个步骤中，读取来自置换矢量pos 32个字节。 Bits 7..3 of these pos bytes determine which byte from x is needed. 这些pos字节的位7..3确定需要x哪个字节。 The right byte is selected by an emulated 256 bits wide AVX2 lane crossing byte shuffle coded here by Ermlg . 通过Ermlg在此处编码的模拟256位宽AVX2通道交叉字节混洗选择右字节。 Bits 2..0 of the pos bytes determine which bit is sought. 位的2..0 pos字节确定寻求哪些位。 With _mm256_movemask_epi8 the 32 bits are collected in one _uint32_t This step is repeated 8 times, to get all the 256 permuted bits. 使用_mm256_movemask_epi8位被收集在一个_uint32_t此步骤重复8次，以获得所有256个置换位。

The code does not look very elegant. 代码看起来不是很优雅。 Nevertheless, I would be surprised if a significantly faster, say two times faster, AVX2 method would exist. 然而，如果存在明显更快，比如说快两倍的AVX2方法，我会感到惊讶。

/*     gcc -O3 -m64 -Wall -mavx2 -march=skylake bitperm_avx2.c     */
#include <immintrin.h>
#include <stdio.h>
#include <stdint.h>

inline __m256i shuf_epi8_lc(__m256i value, __m256i shuffle);
int print_epi64(__m256i  a);

uint32_t get_32_bits(__m256i x, __m256i pos){
    __m256i pshufb_mask  = _mm256_set_epi8(0,0,0,0, 0,0,0,0, 128,64,32,16, 8,4,2,1, 0,0,0,0, 0,0,0,0, 128,64,32,16, 8,4,2,1);
    __m256i byte_pos     = _mm256_srli_epi32(pos, 3);                       /* which byte within the 32 bytes    */
            byte_pos     = _mm256_and_si256(byte_pos, _mm256_set1_epi8(0x1F)); /* mask off the unwanted bits */
    __m256i bit_pos      = _mm256_and_si256(pos, _mm256_set1_epi8(0x07));   /* which bit within the byte         */
    __m256i bit_pos_mask = _mm256_shuffle_epi8(pshufb_mask, bit_pos);       /* get bit mask                      */
    __m256i bytes_wanted = shuf_epi8_lc(x, byte_pos);                       /* get the right bytes               */
    __m256i bits_wanted  = _mm256_and_si256(bit_pos_mask, bytes_wanted);    /* apply the bit mask to get rid of the unwanted bits within the byte */
    __m256i bits_x8      = _mm256_cmpeq_epi8(bits_wanted, bit_pos_mask);    /* check if the bit is set           */        
            return _mm256_movemask_epi8(bits_x8);
}

__m256i get_256_bits(__m256i x, uint8_t* pos){ /* glue the 32 bit results together */
    uint64_t t0 = get_32_bits(x, _mm256_loadu_si256((__m256i*)&pos[0]));
    uint64_t t1 = get_32_bits(x, _mm256_loadu_si256((__m256i*)&pos[32]));
    uint64_t t2 = get_32_bits(x, _mm256_loadu_si256((__m256i*)&pos[64]));
    uint64_t t3 = get_32_bits(x, _mm256_loadu_si256((__m256i*)&pos[96]));
    uint64_t t4 = get_32_bits(x, _mm256_loadu_si256((__m256i*)&pos[128]));
    uint64_t t5 = get_32_bits(x, _mm256_loadu_si256((__m256i*)&pos[160]));
    uint64_t t6 = get_32_bits(x, _mm256_loadu_si256((__m256i*)&pos[192]));
    uint64_t t7 = get_32_bits(x, _mm256_loadu_si256((__m256i*)&pos[224]));
    uint64_t t10 = (t1<<32)|t0;
    uint64_t t32 = (t3<<32)|t2;
    uint64_t t54 = (t5<<32)|t4;
    uint64_t t76 = (t7<<32)|t6;
    return(_mm256_set_epi64x(t76, t54, t32, t10));
}


inline __m256i shuf_epi8_lc(__m256i value, __m256i shuffle){
/* Ermlg's lane crossing byte shuffle https://stackoverflow.com/a/30669632/2439725 */
const __m256i K0 = _mm256_setr_epi8(
    0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70,
    0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0);
const __m256i K1 = _mm256_setr_epi8(
    0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0,
    0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70);
return _mm256_or_si256(_mm256_shuffle_epi8(value, _mm256_add_epi8(shuffle, K0)), 
    _mm256_shuffle_epi8(_mm256_permute4x64_epi64(value, 0x4E), _mm256_add_epi8(shuffle, K1)));
}


int main(){
    __m256i    input = _mm256_set_epi16(0x1234,0x9876,0x7890,0xABCD, 0x3456,0x7654,0x0123,0x4567,
                                        0x0123,0x4567,0x89AB,0xCDEF, 0xFEDC,0xBA98,0x7654,0x3210);
/* Example                                                                                         */
/*            240  224  208  192    176  160  144  128    112   96   80   64     48   32   16    0 */                        
/* input     1234 9876 7890 ABCD | 3456 7654 0123 4567 | 0123 4567 89AB CDEF | FEDC BA98 7654 3210 */
/* output    0000 0000 0012 00FF | 90AB 3210 7654 ABCD | 8712 1200 FF90 AB32 | 7654 ABCD 1087 7654 */
    uint8_t permutation[256] = {16,17,18,19,     20,21,22,23,      24,25,26,27,     28,29,30,31,
                                28,29,30,31,     32,33,34,35,      0,1,2,3,         4,5,6,7,
                                72,73,74,75,     76,77,78,79,      80,81,82,83,     84,85,86,87,      
                                160,161,162,163, 164,165,166,167,  168,169,170,171, 172,173,174,175,  
                                8,9,10,11,       12,13,14,15,      200,201,202,203, 204,205,206,207,
                                208,209,210,211, 212,213,214,215,  215,215,215,215, 215,215,215,215,
                                1,1,1,1,         1,1,1,1,          248,249,250,251, 252,253,254,255,
                                248,249,250,251, 252,253,254,255,  28,29,30,31,     32,33,34,35,
                                72,73,74,75,     76,77,78,79,      80,81,82,83,     84,85,86,87,
                                160,161,162,163, 164,165,166,167,  168,169,170,171, 172,173,174,175,
                                0,1,2,3,         4,5,6,7,          8,9,10,11,       12,13,14,15,
                                200,201,202,203, 204,205,206,207,  208,209,210,211, 212,213,214,215,
                                215,215,215,215, 215,215,215,215,  1,1,1,1,         1,1,1,1,
                                248,249,250,251, 252,253,254,255,  1,1,1,1,         1,1,1,1,
                                1,1,1,1,         1,1,1,1,          1,1,1,1,         1,1,1,1,
                                1,1,1,1,         1,1,1,1,          1,1,1,1,         1,1,1,1};
               printf("input = \n");
               print_epi64(input);
    __m256i    x = get_256_bits(input, permutation);
               printf("permuted input = \n");
               print_epi64(x);
               return 0;
}


int print_epi64(__m256i  a){
    uint64_t  v[4];
    int i;
    _mm256_storeu_si256((__m256i*)v,a);
    for (i = 3; i>=0; i--) printf("%016lX  ",v[i]);
    printf("\n");
    return 0;
}

The output with the example permutation looks correct: 带有示例排列的输出看起来是正确的：

$ ./a.out
input = 
123498767890ABCD  3456765401234567  0123456789ABCDEF  FEDCBA9876543210  
permuted input = 
00000000001200FF  90AB32107654ABCD  87121200FF90AB32  7654ABCD10877654

Efficiency 效率

If you look carefully at the algorithm, you will see that some operations only depend on the permutation vector pos , and not on x . 如果仔细查看算法，您会发现某些操作仅取决于置换向量pos ，而不取决于x 。 This means that the applying the permutation with a variable x , and a fixed pos , should be more efficient than applying the permutation with both variable x and pos . 这意味着应用具有变量x和固定pos的置换应该比使用变量x和pos应用置换更有效。

This is illustrated by the following code: 这由以下代码说明：

/* apply the same permutation several times */
int perm_array(__m256i* restrict x_in, uint8_t* restrict pos, __m256i* restrict x_out){
    for (int i = 0; i<1024; i++){
            x_out[i]=get_256_bits(x_in[i], pos);
    }
    return 0;
}

With clang and gcc this compiles to really nice code : Loop .L5 at line 237 only contains 16 vpshufb s instead of 24. Moreover the vpaddb s are hoisted out of the loop. 使用clang和gcc，这会编译成非常好的代码：第237行的循环.L5只包含16个vpshufb而不是24个。此外， vpaddb被提升出循环。 Note that there is also only one vpermq inside the loop. 请注意，循环内只有一个vpermq 。

I do not know if MSVC will hoist such many instructions outside the loop. 我不知道MSVC是否会在循环外提升这么多指令。 If not, it might be possible to improve the performance of the loop by modifying the code manually. 如果不是，则可以通过手动修改代码来提高循环的性能。 This should be done such that the operations which only depend on pos , and not on x , are hoisted outside the loop. 应该这样做，使得仅依赖于pos而不是x在循环外被提升。

With respect to the performance on Intel Skylake: The throughput of this loop is likely limited by the about 32 port 5 micro-ops per loop iteration. 关于英特尔Skylake的性能：该环路的吞吐量可能受到每个环路迭代约32个端口5个微操作的限制。 This means that the throughput in a loop context such as perm_array is about 256 permuted bits per 32 CPU cycles, or about 8 permuted bits per CPU cycle. 这意味着循环上下文（例如perm_array ）中的吞吐量大约是每32个CPU周期256个置换位，或者每个CPU周期大约8个置换位。

128 bit permutations using AVX2 instructions 使用AVX2指令的128位排列

This code is quite similar to the 256 bit permutation case. 此代码与256位置换情况非常相似。 Although only 128 bits are permuted, the full 256 bit width of the AVX2 registers is used to achieve the best performance. 虽然只有128位置换，但AVX2寄存器的完整256位宽度用于实现最佳性能。 Here the byte shuffles are not emulated. 这里不模拟字节混洗。 This is because there exists an efficient single instruction to do the byte shuffling within the 128 bit lanes: vpshufb . 这是因为存在一条有效的单指令来在128位通道内进行字节混洗： vpshufb 。

Function perm_array_128 tests the performance of the bit permutation for a fixed permutation and a variable input x . 函数perm_array_128测试固定置换和变量输入x的位置换的性能。 The assembly loop contains about 11 port 5 (p5) micro-ops, if we assume an Intel Skylake CPU. 如果我们假设Intel Skylake CPU，汇编循环包含大约11个端口5（p5）微操作。 These 11 p5 micro-ops take at least 11 CPU cycles (throughput). 这11个p5微操作至少需要11个CPU周期（吞吐量）。 So, in the best case we get a throughput of about 12 permuted bits per cycle, which is about 1.5 times as fast as the 256 bit permutation case. 因此，在最好的情况下，我们获得每周期约12个置换位的吞吐量，这大约是256位置换情况的1.5倍。

/*     gcc -O3 -m64 -Wall -mavx2 -march=skylake bitperm128_avx2.c     */
#include <immintrin.h>
#include <stdio.h>
#include <stdint.h>

int print128_epi64(__m128i  a);

uint32_t get_32_128_bits(__m256i x, __m256i pos){                           /* extract 32 permuted bits out from 2x128 bits   */
    __m256i pshufb_mask  = _mm256_set_epi8(0,0,0,0, 0,0,0,0, 128,64,32,16, 8,4,2,1, 0,0,0,0, 0,0,0,0, 128,64,32,16, 8,4,2,1);
    __m256i byte_pos     = _mm256_srli_epi32(pos, 3);                       /* which byte do we need within the 16 byte lanes. bits 6,5,4,3 select the right byte */
            byte_pos     = _mm256_and_si256(byte_pos, _mm256_set1_epi8(0xF)); /* mask off the unwanted bits (unnecessary if _mm256_srli_epi8 would have existed   */
    __m256i bit_pos      = _mm256_and_si256(pos, _mm256_set1_epi8(0x07));   /* which bit within the byte                 */
    __m256i bit_pos_mask = _mm256_shuffle_epi8(pshufb_mask, bit_pos);       /* get bit mask                              */
    __m256i bytes_wanted = _mm256_shuffle_epi8(x, byte_pos);                /* get the right bytes                       */
    __m256i bits_wanted  = _mm256_and_si256(bit_pos_mask, bytes_wanted);    /* apply the bit mask to get rid of the unwanted bits within the byte */
    __m256i bits_x8      = _mm256_cmpeq_epi8(bits_wanted, bit_pos_mask);    /* set all bits if the wanted bit is set     */        
            return _mm256_movemask_epi8(bits_x8);                           /* move most significant bit of each byte to 32 bit register */
}


__m128i permute_128_bits(__m128i x, uint8_t* pos){      /* get bit permutations in 32 bit pieces and glue them together */
    __m256i  x2 = _mm256_broadcastsi128_si256(x);   /* broadcast x to the hi and lo lane                            */
    uint64_t t0 = get_32_128_bits(x2, _mm256_loadu_si256((__m256i*)&pos[0]));
    uint64_t t1 = get_32_128_bits(x2, _mm256_loadu_si256((__m256i*)&pos[32]));
    uint64_t t2 = get_32_128_bits(x2, _mm256_loadu_si256((__m256i*)&pos[64]));
    uint64_t t3 = get_32_128_bits(x2, _mm256_loadu_si256((__m256i*)&pos[96]));
    uint64_t t10 = (t1<<32)|t0;
    uint64_t t32 = (t3<<32)|t2;
    return(_mm_set_epi64x(t32, t10));
}

/* Test loop performance with the following loop (see assembly) -> 11 port5 uops inside the critical loop */
/* Use gcc -O3 -m64 -Wall -mavx2 -march=skylake -S bitperm128_avx2.c to generate the assembly             */
int perm_array_128(__m128i* restrict x_in, uint8_t* restrict pos, __m128i* restrict x_out){
    for (int i = 0; i<1024; i++){
            x_out[i]=permute_128_bits(x_in[i], pos);
    }
    return 0;
}


int main(){
    __m128i    input = _mm_set_epi16(0x0123,0x4567,0xFEDC,0xBA98,  0x7654,0x3210,0x89AB,0xCDEF);
/* Example                                                                                         */
/*             112   96   80   64     48   32   16    0 */                        
/* input      0123 4567 FEDC BA98   7654 3210 89AB CDEF */
/* output     8FFF CDEF DCBA 08EF   CDFF DCBA EFF0 89AB */
    uint8_t permutation[128] = {16,17,18,19,     20,21,22,23,      24,25,26,27,     28,29,30,31,
                                32,32,32,32,     36,36,36,36,      0,1,2,3,         4,5,6,7,
                                72,73,74,75,     76,77,78,79,      80,81,82,83,     84,85,86,87,      
                                0,0,0,0,         0,0,0,0,          8,9,10,11,       12,13,14,15,      
                                0,1,2,3,         4,5,6,7,          28,29,30,31,     32,33,34,35,
                                72,73,74,75,     76,77,78,79,      80,81,82,83,     84,85,86,87,
                                0,1,2,3,         4,5,6,7,          8,9,10,11,       12,13,14,15,
                                1,1,1,1,         1,1,1,1,          1,1,1,1,         32,32,32,1};
               printf("input = \n");
               print128_epi64(input);
    __m128i    x = permute_128_bits(input, permutation);
               printf("permuted input = \n");
               print128_epi64(x);
               return 0;
}


int print128_epi64(__m128i  a){
  uint64_t  v[2];
  int i;
  _mm_storeu_si128((__m128i*)v,a);
  for (i = 1; i>=0; i--) printf("%016lX  ",v[i]);
  printf("\n");
  return 0;
}

Example output for some arbitrary permutation: 某些任意排列的示例输出：

$ ./a.out
input = 
01234567FEDCBA98  7654321089ABCDEF  
permuted input = 
8FFFCDEFDCBA08EF  CDFFDCBAEFF089AB

使用SIMD指令执行任意128/256/512位置换的最快方法是什么？

问题描述

1 个解决方案

解决方案1
5 已采纳 2019-01-29 12:20:28

使用SIMD指令执行任意128/256/512位置换的最快方法是什么？

问题描述

1 个解决方案

解决方案1 5 已采纳 2019-01-29 12:20:28

解决方案1
5 已采纳 2019-01-29 12:20:28