简体   繁体   中英

Bit hack: Expanding bits

I am trying to convert a uint16_t input to a uint32_t bit mask. One bit in the input toggles two bits in the output bit mask. Here is an example converting a 4-bit input to an 8-bit bit mask:

Input    Output
ABCDb -> AABB CCDDb

A,B,C,D are individual bits

Example outputs:

0000b -> 0000 0000b
0001b -> 0000 0011b
0010b -> 0000 1100b
0011b -> 0000 1111b
....
1100b -> 1111 0000b
1101b -> 1111 0011b
1110b -> 1111 1100b
1111b -> 1111 1111b

Is there a bithack-y way to achieve this behavior?

Interleaving bits by Binary Magic Numbers contained the clue:

uint32_t expand_bits(uint16_t bits)
{
    uint32_t x = bits;

    x = (x | (x << 8)) & 0x00FF00FF;
    x = (x | (x << 4)) & 0x0F0F0F0F;
    x = (x | (x << 2)) & 0x33333333;
    x = (x | (x << 1)) & 0x55555555;

    return x | (x << 1);
}

The first four steps consecutively interleave the source bits in groups of 8, 4, 2, 1 bits with zero bits, resulting in 00AB00CD after the first step, 0A0B0C0D after the second step, and so on. The last step then duplicates each even bit (containing an original source bit) into the neighboring odd bit, thereby achieving the desired bit arrangement.

A number of variants are possible. The last step can also be coded as x + (x << 1) or 3 * x . The | operators in the first four steps can be replaced by ^ operators. The masks can also be modified as some bits are naturally zero and don't need to be cleared. On some processors short masks may be incorporated into machine instructions as immediates, reducing the effort for constructing and / or loading the mask constants. It may also be advantageous to increase instruction-level parallelism for out-of-order processors and optimize for those with shift-add or integer-multiply-add instructions. One code variant incorporating various of these ideas is:

uint32_t expand_bits (uint16_t bits)
{
    uint32_t x = bits;

    x = (x ^ (x << 8)) & ~0x0000FF00;
    x = (x ^ (x << 4)) & ~0x00F000F0;
    x = x ^ (x << 2);
    x = ((x & 0x22222222) << 1) + (x & 0x11111111);
    x = (x << 1) + x;

    return x;
}

The easiest way to map a 4-bit input to an 8-bit output is with a 16 entry table. So then it's just a matter of extracting 4 bits at a time from the uint16_t , doing a table lookup, and inserting the 8-bit value into the output.

uint32_t expandBits( uint16_t input )
{
    uint32_t table[16] = {
        0x00, 0x03, 0x0c, 0x0f,
        0x30, 0x33, 0x3c, 0x3f,
        0xc0, 0xc3, 0xcc, 0xcf,
        0xf0, 0xf3, 0xfc, 0xff
    };

    uint32_t output;
    output  = table[(input >> 12) & 0xf] << 24;
    output |= table[(input >>  8) & 0xf] << 16;
    output |= table[(input >>  4) & 0xf] <<  8;
    output |= table[ input        & 0xf];
    return output;
}

This provides a decent compromise between performance and readability. It doesn't have quite the performance of cmaster's over-the-top lookup solution, but it's certainly more understandable than thndrwrks' magical mystery solution. As such, it provides a technique that can be applied to a much larger variety of problems, ie use a small lookup table to solve a larger problem.

In case you want to get some estimate of relative speeds, some community wiki test code. Adjust as needed.

void f_cmp(uint32_t (*f1)(uint16_t x), uint32_t (*f2)(uint16_t x)) {
  uint16_t x = 0;
  do {
    uint32_t y1 = (*f1)(x);
    uint32_t y2 = (*f2)(x);
    if (y1 != y2) {
      printf("%4x %8lX %8lX\n", x, (unsigned long) y1, (unsigned long) y2);
    }
  } while (x++ != 0xFFFF);
}

void f_time(uint32_t (*f1)(uint16_t x)) {
  f_cmp(expand_bits, f1);
  clock_t t1 = clock();
  volatile uint32_t y1 = 0;
  unsigned n = 1000;
  for (unsigned i = 0; i < n; i++) {
    uint16_t x = 0;
    do {
      y1 += (*f1)(x);
    } while (x++ != 0xFFFF);
  }
  clock_t t2 = clock();
  printf("%6llu %6llu: %.6f %lX\n", (unsigned long long) t1,
          (unsigned long long) t2, 1.0 * (t2 - t1) / CLOCKS_PER_SEC / n,
          (unsigned long) y1);
  fflush(stdout);
}

int main(void) {
  f_time(expand_bits);
  f_time(expandBits);
  f_time(remask);
  f_time(javey);
  f_time(thndrwrks_expand);
  // now in the other order
  f_time(thndrwrks_expand);
  f_time(javey);
  f_time(remask);
  f_time(expandBits);
  f_time(expand_bits);
  return 0;
}

Results

     0    280: 0.000280 FE0C0000 // fast
   280    702: 0.000422 FE0C0000
   702   1872: 0.001170 FE0C0000
  1872   3026: 0.001154 FE0C0000
  3026   4399: 0.001373 FE0C0000 // slow

  4399   5740: 0.001341 FE0C0000
  5740   6879: 0.001139 FE0C0000
  6879   8034: 0.001155 FE0C0000
  8034   8470: 0.000436 FE0C0000
  8486   8751: 0.000265 FE0C0000

Here's a working implementation:

uint32_t remask(uint16_t x)
{
    uint32_t i;
    uint32_t result = 0;
    for (i=0;i<16;i++) {
        uint32_t mask = (uint32_t)x & (1U << i);
        result |= mask << (i);
        result |= mask << (i+1);
    }
    return result;
}

On each iteration of the loop, the bit in question from the uint16_t is masked out and stored.

That bit is then shifted by its bit position and ORed into the result, then shifted again by its bit position plus 1 and ORed into the result.

A simple loop. Maybe not bit-hacky enough?

uint32_t thndrwrks_expand(uint16_t x) {
  uint32_t mask = 3;
  uint32_t y = 0;
  while (x) {
    if (x&1) y |= mask;
    x >>= 1;
    mask <<= 2;
  }
  return y;
}

Tried another that is twice as fast. Still 655/272 as slow as expand_bits() . Appears to be fastest 16 loop iteration solution.

uint32_t thndrwrks_expand(uint16_t x) {
  uint32_t y = 0;
  for (uint16_t mask = 0x8000; mask; mask >>= 1) {
    y <<= 1;
    y |= x&mask;
  }
  y *= 3;
  return y;
}

If your concern is performance and simplicity, you are likely best of with a big lookup table (64k entries of 4 bytes each). With that, you can pretty much use any algorithm you like to generate the table, lookup will just be a single memory access.

If that table is too big for your liking, you can split it. For instance, you can use a 8 bit lookup table with 256 entries of 2 bytes each. With that you can perform the entire operation with just two lookups. Bonus is, that this approach allows for type-punning tricks to avoid the hassle of splitting the address with bit operations:

//Implementation defined behavior ahead:
//Works correctly for both little and big endian machines,
//however, results will be wrong on a PDP11...
uint32_t getMask(uint16_t input) {
    assert(sizeof(uint16_t) == 2);
    assert(sizeof(uint32_t) == 4);
    static const uint16_t lookupTable[256] = { 0x0000, 0x0003, 0x000c, 0x000f, ... };

    unsigned char* inputBytes = (unsigned char*)&input;    //legal because we type-pun to char, but the order of the bytes is implementation defined
    char outputBytes[4];
    uint16_t* outputShorts = (uint16_t*)outputBytes;    //legal because we type-pun from char, but the order of the shorts is implementation defined
    outputShorts[0] = lookupTable[inputBytes[0]];
    outputShorts[1] = lookupTable[inputBytes[1]];
    uint32_t output;
    memcpy(&output, outputBytes, 4);    //can't type-pun directly from uint16 to uint32_t due to strict aliasing rules
    return output;
}

The code above works around strict aliasing rules by casting only to/from char , which is an explicit exception to the strict aliasing rules. It also works around the effects of little/big-endian byte order by building the result in the same order as the input was split. However, it still exposes implementation defined behavior: A machine with a byte order of 1, 0, 3, 2 , or other middle endian orders , will silently produce wrong results (there have actually been such CPUs like the PDP11 ...).

Of course, you can split the lookup table even further, but I doubt that would do you any good.

Try this, where input16 is the uint16_t input mask:

uint32_t input32 = (uint32_t) input16;
uint32_t result = 0;
uint32_t i;
for(i=0; i<16; i++)
{
    uint32_t bit_at_i = (input32 & (((uint32_t)1) << i)) >> i;
    result |= ((bit_at_i << (i*2)) | (bit_at_i << ((i*2)+1)));
}
// result is now the 32 bit expanded mask

My solution is meant to run on mainstream x86 PCs and be simple and generic. I did not write this to compete for the fastest and/or shortest implementation. It is just another way to solve the problem submitted by OP.

#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>

#define BITS_TO_EXPAND (4U)
#define SIZE_MAX (256U)

static bool expand_uint(unsigned int *toexpand,unsigned int *expanded);

int main(void)
{
    unsigned int in = 12;
    unsigned int out = 0;
    bool success;
    char buff[SIZE_MAX];

    success = expand_uint(&in,&out);
    if(false == success)
    {
        (void) puts("Error: expand_uint failed");
        return EXIT_FAILURE;
    }
    (void) snprintf(buff, (size_t) SIZE_MAX,"%u expanded is %u\n",in,out);
    (void) fputs(buff,stdout);
    return EXIT_SUCCESS;
}
/*
** It expands an unsigned int so that every bit in a nibble is copied twice
** in the resultant number. It returns true on success, false otherwise.
*/
static bool expand_uint(unsigned int *toexpand,unsigned int *expanded)
{
    unsigned int i;
    unsigned int shifts = 0;
    unsigned int mask;

    if(NULL == toexpand || NULL == expanded)
    {
        return false;
    }
    *expanded = 0;
    for(i = 0; i < BIT_TO_EXPAND; i++)
    {
        mask = (*toexpand >> i) & 1;
        *expanded |= (mask << shifts);
        ++shifts;
        *expanded |= (mask << shifts);
        ++shifts;
    }
    return true;
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM