简体   繁体   中英

Using Intrinsics to Extract And Shift Odd/Even Bits

Is there a way to optimize the following code using intrinsics? It takes all the odd indexed bits in a 16 bit integer and shifts them as far right as possible.

I was thinking maybe using the c++ equivalent of ISHFTC from Fortran (is there even a c++ equivalent of this?). But I feel that there is a more efficient way.

int x = some16bitInt;
x = x&0x5555;
int y = 0;
for (int i = 0; i < 8; i++)
    y = y | ((x >> i) & (0x01 << i));
'''

Fortran ISHFTC is just a rotate. C doesn't directly have this, but you can portably + safely write a function that compilers with pattern-recognize and compile to a single rotate instruction. Best practices for circular shift (rotate) operations in C++

I'm not sure that's a useful building block, but it is available.


On x86 with BMI2 instruction-set extensions , there's a pext bit-extract instruction, which you could use with a 0x5555 control input. See Intel's docs for _pext_u32 and _u64

It's very fast on Intel Haswell and later (1 uop, 3 cycle latency, 1/clock throughput),
but pretty slow on AMD (Ryzen: 7 uops, 18 cycle latency/throughput). https://agner.org/optimize/ I think that's worse than the shift/mask stuff I've come up using pure C, especially if latency matters (not just throughput).

#include <immintrin.h>

unsigned extract_even_bits_bmi2(unsigned a) {
   return _pext_u32(a, 0x5555);
}

With GCC / clang, you have to compile with -mbmi2 (or better, -march=haswell ) to enable use of BMI2 intrinsics.


Portable ISO C++

I don't think the usual multiply tricks (to get multiple input bytes shifted and added into the top byte of a result) will work here; you have too many bits and they're too close together. See How to count the number of set bits in a 32-bit integer? for a use-case :
((n & 0x0F0F0F0F) * 0x01010101) >> 24 to horizontally add all the bytes in n .

You could imagine using something like that on your input with * 0x08040201 to align the bits from different bytes differently. But that still leaves major unsolved problems. Perhaps SIMD multiply with 8-bit elements to get pairs of bits shifted together?

But that's not better than moving bits around by masking, shifting, and ORing or ADDing the moved bits with the not-moving bits. With about log2(n_bits) steps, we can get all the bits contiguous.

There are multiple ways to do this, see on Godbolt . There's room for improvement in this, like tweaking to compile better for one ISA vs. another. eg helping some ARM compilers see that 0b0000011000000110 is just the other constant right-shifted, so it can and r0, r1, r2, lsr #4 or something.

Or to shift bits to the right instead of left, for ISAs that can't do anything special for left.

unsigned pack_even_bits16_v2(unsigned x)
{
    x &= 0x5555;        // 0a0b0c0d0e0f0g0h
    x += x<<1;          // aabbccddeeffgghh    // x86 LEA eax, [rdi + rdi*2]
    unsigned move = x &  0b0000011000000110;   // bits to move
    unsigned keep = x &  0b0110000001100000;   // bits to keep
    x = keep + (move << 2);  // 0abcd000 0efgh000

                       // 0abcd000 0efgh000    // with byte boundary shown
    unsigned tmp = x >> 7;  // high group into place, shifting out the low bits
    x &= 0xFF;    // grab the whole low byte ; possibly with a zero-latency movzx
    x = (x>>3) | tmp;
    return x;
}

I'm shifting low bits left instead of shifting high bits right because x86 can left-shift-and-add with one instruction, LEA. On other ISAs it would probably save one shift at the end to move bits to the right.

This compiles pretty nicely for AArch64 and PowerPC64 as well as x86. Clang sees through this bit manipulation for PowerPC and uses the powerful rlwinm (Rotate Left Word Immediate AND Mask) and rlwimi (... Mask Insert) instructions :)

# clang trunk -O3 for PowerPC64.
# Compiling the  x += x & 0x1111;  version, not the  x += x<<1 version where we get a multiply
        andi. 4, 3, 21845        # x & 0x5555
        andi. 3, 3, 4369         # x & 0x1111
        add 4, 4, 3              # 
        rlwinm 3, 4, 31, 30, 31  # isolate the low 2 bits.  PPC counts bits from MSB=0 LSB=31 for 32-bit registers
        rlwimi 3, 4, 29, 28, 29  # insert the next 2-bit bitfield
        rlwimi 3, 4, 27, 26, 27  # ...
        rlwimi 3, 4, 25, 24, 25
        blr

It would have been better to combine pairs instead of forming one big chain.


Another way to move bits is to zero the selected bits with XOR, then shift and deposit them somewhere else with a shift and add.

   unsigned tmp = x & mask;
    x += tmp;          // left shift those bits
    x += tmp<<1;       // left shift them again.  (x86 can do this with LEA eax, [rax + rdx*2])

or

    unsigned tmp = x &   0b0000011000000110;   // bits to move
    x ^= tmp;          // clear those bits
    x += tmp << 2;     // LEA eax, [eax + edx*4]  1 fast instruction on x86

When only moving by 2 positions, add + shift-and-add is basically the same length of dependency chain as xor + shift-and-add.

But clearing the old bits conditionally instead of with the opposite mask is probably worse. At least if the opposite mask fits in an immediate, or if the ISA has an ANDNOT instruction. Or for ARM, a shifted mask. AND 2 ways on the old x can run in parallel, vs. tmp = x & mask; x ^= tmp serializing execution with a data dependency if it compiles as written. (It doesn't; gcc and clang are smart enough to know what XOR does and unconditionally clear those bits.)

Sure, here's how:

int y = (int)_pext_u32( (unsigned int)some16bitInt, 0x5555 );

Unfortunately for you, this instruction is from BMI2 set and requires relatively recent CPU, Intel Haswell or newer, AMD Excavator or newer. But where it's supported, it's very fast.

The most flexible bit manipulation in x86 (indeed, almost any CPU) is indexed read-from-memory. It can do completely arbitrary mappings in constant-time, typically in 1-4 cycles (assuming the memory is cached).

Since you're only talking about 8 bits, and you can easily put the bits you want into the lower 8 bits of a register, albeit in the wrong order, you can just use a lookup table.

unsigned pack_even_bits16_table(unsigned x) {  // x = ?a?b?c?d ?e?f?g?h
  size_t m1 = x & 0x55;         //  m1 = 0e0f0g0h
  size_t m2 = (x >> 7) & 0xAA;  //  m2 = a0b0c0d0
  return map[m1 + m2];          // sum = aebfcgdh
}

where the map is

const unsigned char map[256] = {
    0,   1,   16,  17,  2,   3,   18,  19,  32,  33,  48,  49,  34,  35,  50,  51,
    4,   5,   20,  21,  6,   7,   22,  23,  36,  37,  52,  53,  38,  39,  54,  55,
    64,  65,  80,  81,  66,  67,  82,  83,  96,  97,  112, 113, 98,  99,  114, 115,
    68,  69,  84,  85,  70,  71,  86,  87,  100, 101, 116, 117, 102, 103, 118, 119,
    8,   9,   24,  25,  10,  11,  26,  27,  40,  41,  56,  57,  42,  43,  58,  59,
    12,  13,  28,  29,  14,  15,  30,  31,  44,  45,  60,  61,  46,  47,  62,  63,
    72,  73,  88,  89,  74,  75,  90,  91,  104, 105, 120, 121, 106, 107, 122, 123,
    76,  77,  92,  93,  78,  79,  94,  95,  108, 109, 124, 125, 110, 111, 126, 127,
    128, 129, 144, 145, 130, 131, 146, 147, 160, 161, 176, 177, 162, 163, 178, 179,
    132, 133, 148, 149, 134, 135, 150, 151, 164, 165, 180, 181, 166, 167, 182, 183,
    192, 193, 208, 209, 194, 195, 210, 211, 224, 225, 240, 241, 226, 227, 242, 243,
    196, 197, 212, 213, 198, 199, 214, 215, 228, 229, 244, 245, 230, 231, 246, 247,
    136, 137, 152, 153, 138, 139, 154, 155, 168, 169, 184, 185, 170, 171, 186, 187,
    140, 141, 156, 157, 142, 143, 158, 159, 172, 173, 188, 189, 174, 175, 190, 191,
    200, 201, 216, 217, 202, 203, 218, 219, 232, 233, 248, 249, 234, 235, 250, 251,
    204, 205, 220, 221, 206, 207, 222, 223, 236, 237, 252, 253, 238, 239, 254, 255,
};

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM