简体   繁体   English


[英]Bit hack: Expanding bits

I am trying to convert a uint16_t input to a uint32_t bit mask. 我试图将uint16_t输入转换为uint32_t位掩码。 One bit in the input toggles two bits in the output bit mask. 输入中的一位在输出位掩码中切换两位。 Here is an example converting a 4-bit input to an 8-bit bit mask: 以下是将4位输入转换为8位位掩码的示例:

Input    Output

A,B,C,D are individual bits

Example outputs:

0000b -> 0000 0000b
0001b -> 0000 0011b
0010b -> 0000 1100b
0011b -> 0000 1111b
1100b -> 1111 0000b
1101b -> 1111 0011b
1110b -> 1111 1100b
1111b -> 1111 1111b

Is there a bithack-y way to achieve this behavior? 有没有一种方法来实现这种行为?

Interleaving bits by Binary Magic Numbers contained the clue: Binary Magic Numbers的交错位包含了线索:

uint32_t expand_bits(uint16_t bits)
    uint32_t x = bits;

    x = (x | (x << 8)) & 0x00FF00FF;
    x = (x | (x << 4)) & 0x0F0F0F0F;
    x = (x | (x << 2)) & 0x33333333;
    x = (x | (x << 1)) & 0x55555555;

    return x | (x << 1);

The first four steps consecutively interleave the source bits in groups of 8, 4, 2, 1 bits with zero bits, resulting in 00AB00CD after the first step, 0A0B0C0D after the second step, and so on. 的前四个步骤连续地交织所述源位2的8组,4个,1位用零个比特,导致00AB00CD在第一步骤之后, 0A0B0C0D之后的第二步骤中,依此类推。 The last step then duplicates each even bit (containing an original source bit) into the neighboring odd bit, thereby achieving the desired bit arrangement. 最后一步然后将每个偶数位(包含原始源位)复制到相邻奇数位中,从而实现所需的位排列。

A number of variants are possible. 许多变体都是可能的。 The last step can also be coded as x + (x << 1) or 3 * x . 最后一步也可以编码为x + (x << 1)3 * x The | | operators in the first four steps can be replaced by ^ operators. 前四个步骤中的运算符可以由^运算符替换。 The masks can also be modified as some bits are naturally zero and don't need to be cleared. 掩码也可以修改,因为一些位自然为零,不需要清除。 On some processors short masks may be incorporated into machine instructions as immediates, reducing the effort for constructing and / or loading the mask constants. 在一些处理器上,短掩模可以作为中间体结合到机器指令中,减少了构造和/或加载掩模常数的努力。 It may also be advantageous to increase instruction-level parallelism for out-of-order processors and optimize for those with shift-add or integer-multiply-add instructions. 增加无序处理器的指令级并行性并针对具有shift-add或整数乘加指令的那些进行优化也可能是有利的。 One code variant incorporating various of these ideas is: 包含各种这些想法的一个代码变体是:

uint32_t expand_bits (uint16_t bits)
    uint32_t x = bits;

    x = (x ^ (x << 8)) & ~0x0000FF00;
    x = (x ^ (x << 4)) & ~0x00F000F0;
    x = x ^ (x << 2);
    x = ((x & 0x22222222) << 1) + (x & 0x11111111);
    x = (x << 1) + x;

    return x;

The easiest way to map a 4-bit input to an 8-bit output is with a 16 entry table. 将4位输入映射到8位输出的最简单方法是使用16个输入表。 So then it's just a matter of extracting 4 bits at a time from the uint16_t , doing a table lookup, and inserting the 8-bit value into the output. 那么这只是从uint16_t一次提取4位,进行表查找,并将8位值插入输出的问题。

uint32_t expandBits( uint16_t input )
    uint32_t table[16] = {
        0x00, 0x03, 0x0c, 0x0f,
        0x30, 0x33, 0x3c, 0x3f,
        0xc0, 0xc3, 0xcc, 0xcf,
        0xf0, 0xf3, 0xfc, 0xff

    uint32_t output;
    output  = table[(input >> 12) & 0xf] << 24;
    output |= table[(input >>  8) & 0xf] << 16;
    output |= table[(input >>  4) & 0xf] <<  8;
    output |= table[ input        & 0xf];
    return output;

This provides a decent compromise between performance and readability. 这在性能和可读性之间提供了适当的折衷。 It doesn't have quite the performance of cmaster's over-the-top lookup solution, but it's certainly more understandable than thndrwrks' magical mystery solution. 它没有cmaster的over-the-top查找解决方案的性能,但它肯定比thndrwrks神奇的神秘解决方案更容易理解。 As such, it provides a technique that can be applied to a much larger variety of problems, ie use a small lookup table to solve a larger problem. 因此,它提供了一种可应用于更多种类问题的技术,即使用小型查找表来解决更大的问题。

In case you want to get some estimate of relative speeds, some community wiki test code. 如果你想得到一些相对速度的估计,一些社区维基测试代码。 Adjust as needed. 根据需要调整。

void f_cmp(uint32_t (*f1)(uint16_t x), uint32_t (*f2)(uint16_t x)) {
  uint16_t x = 0;
  do {
    uint32_t y1 = (*f1)(x);
    uint32_t y2 = (*f2)(x);
    if (y1 != y2) {
      printf("%4x %8lX %8lX\n", x, (unsigned long) y1, (unsigned long) y2);
  } while (x++ != 0xFFFF);

void f_time(uint32_t (*f1)(uint16_t x)) {
  f_cmp(expand_bits, f1);
  clock_t t1 = clock();
  volatile uint32_t y1 = 0;
  unsigned n = 1000;
  for (unsigned i = 0; i < n; i++) {
    uint16_t x = 0;
    do {
      y1 += (*f1)(x);
    } while (x++ != 0xFFFF);
  clock_t t2 = clock();
  printf("%6llu %6llu: %.6f %lX\n", (unsigned long long) t1,
          (unsigned long long) t2, 1.0 * (t2 - t1) / CLOCKS_PER_SEC / n,
          (unsigned long) y1);

int main(void) {
  // now in the other order
  return 0;

Results 结果

     0    280: 0.000280 FE0C0000 // fast
   280    702: 0.000422 FE0C0000
   702   1872: 0.001170 FE0C0000
  1872   3026: 0.001154 FE0C0000
  3026   4399: 0.001373 FE0C0000 // slow

  4399   5740: 0.001341 FE0C0000
  5740   6879: 0.001139 FE0C0000
  6879   8034: 0.001155 FE0C0000
  8034   8470: 0.000436 FE0C0000
  8486   8751: 0.000265 FE0C0000

Here's a working implementation: 这是一个有效的实现:

uint32_t remask(uint16_t x)
    uint32_t i;
    uint32_t result = 0;
    for (i=0;i<16;i++) {
        uint32_t mask = (uint32_t)x & (1U << i);
        result |= mask << (i);
        result |= mask << (i+1);
    return result;

On each iteration of the loop, the bit in question from the uint16_t is masked out and stored. 在循环的每次迭代中,来自uint16_t有问题的位被屏蔽并存储。

That bit is then shifted by its bit position and ORed into the result, then shifted again by its bit position plus 1 and ORed into the result. 然后将该位移位其位位置并对结果进行“或”运算,然后再次移位其位加1并对结果进行“或”运算。

A simple loop. 一个简单的循环。 Maybe not bit-hacky enough? 也许不够点hacky?

uint32_t thndrwrks_expand(uint16_t x) {
  uint32_t mask = 3;
  uint32_t y = 0;
  while (x) {
    if (x&1) y |= mask;
    x >>= 1;
    mask <<= 2;
  return y;

Tried another that is twice as fast. 尝试另一个快两倍。 Still 655/272 as slow as expand_bits() . 仍然是655/272,与expand_bits()一样慢。 Appears to be fastest 16 loop iteration solution. 似乎是最快的16循环迭代解决方案。

uint32_t thndrwrks_expand(uint16_t x) {
  uint32_t y = 0;
  for (uint16_t mask = 0x8000; mask; mask >>= 1) {
    y <<= 1;
    y |= x&mask;
  y *= 3;
  return y;

If your concern is performance and simplicity, you are likely best of with a big lookup table (64k entries of 4 bytes each). 如果您关注的是性能和简单性,那么最好使用大型查找表(每个4字节的64k条目)。 With that, you can pretty much use any algorithm you like to generate the table, lookup will just be a single memory access. 有了它,您几乎可以使用任何您喜欢的算法来生成表,查找将只是一个内存访问。

If that table is too big for your liking, you can split it. 如果该表太大而不适合您,您可以拆分它。 For instance, you can use a 8 bit lookup table with 256 entries of 2 bytes each. 例如,您可以使用8位查找表,其中256个条目各有2个字节。 With that you can perform the entire operation with just two lookups. 有了它,您只需两次查找即可执行整个操作。 Bonus is, that this approach allows for type-punning tricks to avoid the hassle of splitting the address with bit operations: 额外的是,这种方法允许类型惩罚技巧,以避免使用位操作分割地址的麻烦:

//Implementation defined behavior ahead:
//Works correctly for both little and big endian machines,
//however, results will be wrong on a PDP11...
uint32_t getMask(uint16_t input) {
    assert(sizeof(uint16_t) == 2);
    assert(sizeof(uint32_t) == 4);
    static const uint16_t lookupTable[256] = { 0x0000, 0x0003, 0x000c, 0x000f, ... };

    unsigned char* inputBytes = (unsigned char*)&input;    //legal because we type-pun to char, but the order of the bytes is implementation defined
    char outputBytes[4];
    uint16_t* outputShorts = (uint16_t*)outputBytes;    //legal because we type-pun from char, but the order of the shorts is implementation defined
    outputShorts[0] = lookupTable[inputBytes[0]];
    outputShorts[1] = lookupTable[inputBytes[1]];
    uint32_t output;
    memcpy(&output, outputBytes, 4);    //can't type-pun directly from uint16 to uint32_t due to strict aliasing rules
    return output;

The code above works around strict aliasing rules by casting only to/from char , which is an explicit exception to the strict aliasing rules. 上面的代码通过仅转换为/来自char来解决严格别名规则,这是严格别名规则的显式异常。 It also works around the effects of little/big-endian byte order by building the result in the same order as the input was split. 它还通过以与输入分割相同的顺序构建结果来解决小/大端字节顺序的影响。 However, it still exposes implementation defined behavior: A machine with a byte order of 1, 0, 3, 2 , or other middle endian orders , will silently produce wrong results (there have actually been such CPUs like the PDP11 ...). 但是,它仍然暴露出实现定义的行为:一台机器的字节顺序1, 0, 3, 2 ,或其他中间端的订单 ,会悄悄地产生错误的结果(有其实一直这样的CPU如PDP11 ...)。

Of course, you can split the lookup table even further, but I doubt that would do you any good. 当然,您可以进一步拆分查找表,但我怀疑这对您有什么好处。

Try this, where input16 is the uint16_t input mask: 试试这个,其中input16是uint16_t输入掩码:

uint32_t input32 = (uint32_t) input16;
uint32_t result = 0;
uint32_t i;
for(i=0; i<16; i++)
    uint32_t bit_at_i = (input32 & (((uint32_t)1) << i)) >> i;
    result |= ((bit_at_i << (i*2)) | (bit_at_i << ((i*2)+1)));
// result is now the 32 bit expanded mask

My solution is meant to run on mainstream x86 PCs and be simple and generic. 我的解决方案是在主流x86 PC上运行,简单而通用。 I did not write this to compete for the fastest and/or shortest implementation. 我没有写这个来竞争最快和/或最短的实现。 It is just another way to solve the problem submitted by OP. 这只是解决OP提交的问题的另一种方法。

#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>

#define BITS_TO_EXPAND (4U)
#define SIZE_MAX (256U)

static bool expand_uint(unsigned int *toexpand,unsigned int *expanded);

int main(void)
    unsigned int in = 12;
    unsigned int out = 0;
    bool success;
    char buff[SIZE_MAX];

    success = expand_uint(&in,&out);
    if(false == success)
        (void) puts("Error: expand_uint failed");
        return EXIT_FAILURE;
    (void) snprintf(buff, (size_t) SIZE_MAX,"%u expanded is %u\n",in,out);
    (void) fputs(buff,stdout);
    return EXIT_SUCCESS;
** It expands an unsigned int so that every bit in a nibble is copied twice
** in the resultant number. It returns true on success, false otherwise.
static bool expand_uint(unsigned int *toexpand,unsigned int *expanded)
    unsigned int i;
    unsigned int shifts = 0;
    unsigned int mask;

    if(NULL == toexpand || NULL == expanded)
        return false;
    *expanded = 0;
    for(i = 0; i < BIT_TO_EXPAND; i++)
        mask = (*toexpand >> i) & 1;
        *expanded |= (mask << shifts);
        *expanded |= (mask << shifts);
    return true;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM