简体   繁体   中英

Most efficient way to find the index of the only '1' bit in a char variable (in C)

This is an interview question:
You are given a char variable named ch , when you know that it represents a number that in its binary form, only one of its eight bits will be equal to '1'. IE , the only possible values for ch are: 0x1, 0x2, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80 .
Given the variable ch , I need to write the most efficient code to get the index of that '1' bit. For example: if ch == 0x1 -> result is 0. if ch == 0x4 -> result is 2.

The obvious way is to use switch-case, but I need something more efficient.
Is there any bit manipulation you can do here for efficient implementation?

An unsigned char variable is supposedly only 8 bit wide. In order to encode the position of the bit we need only 3 bits. That means that we can build a 24-bit "table" that contains all 8 of possible 3-bit answers in their natural order

111 110 101 100 011 010 001 000 =

0xFAC688

If your variable ch is known to contain only one 1 bit, then it is a power of 2. Dividing something by ch will right-shift the original value by the index of your 1 bit. So, if we divide the above "table" by your ch three times the answer will get shifted to the lowest 3 bits of the result

unsigned position = (0xFAC688 / ch / ch / ch) & 0x7;

End of story. The above could probably be rewritten more efficiently, while preserving the general principle.


Note, that this is basically the same principle that's used in the approaches based on De Bruijn sequences. However, the purpose of De Bruijn sequence is to pack the index table in situations when the original "unpacked" table (like my table above) does not fit into an integer. As an "unpleasant" side effect, De Bruijn sequence reorders the index table, breaking the original natural sequence of indices. This requires extra re-mapping efforts to extract the proper result from the De Bruijn sequence.

With only 24 bits we don't have this problem here, which means that there's no need to involve De Bruijn and its accompanying tricks.

On the other hand, a packed table requires a shorter shift, which will simplify (and thus optimize) the calculation of the divisor to achieve the desired shift's length. In case of De Bruijn sequence, there's no need to calculate the divisor at all - your ch is already it. So, De Bruijn sequence might easily end up being more efficient.

Well, if ch has a single bit set, then the count of 1 bits in ch-1 is the index of that bit. Ideally, you'd want to find that without looping or branching, since branches are expensive, so I'd write that something like this:

int index = ((unsigned char)ch)-1;
index = ((index & 0xAA)>>1)+(index & 0x55);  //sums of pairs of bits
index = ((index & 0xCC)>>2)+(index & 0x33);  //sums of 4s of bits
index = ((index & 0xF0)>>4)+(index & 0x0F);  //sum of 8 bits

There is also an extremely clever answer using fewer operations at the cost of a multiplication and a lookup:

int index = indexMap[((((int)(unsigned char)ch)*DEBRUIJN)>>16)&7];

The bits in DEBRUIJN must be a De Bruijn sequence ( https://en.wikipedia.org/wiki/De_Bruijn_sequence ), ensuring that lookup index will be different for every value of ch . indexMap maps those lookup indexes to the results you want.

Note also that, following @rici's comment, indexMap is so small that you can pack it into a single int.

The type char may be either signed or unsigned (implementation-defined behavior). In order to safely operate on the value 0x80 we should operate explicitly with unsigned char data.

I assume that there are no special functions available that give us the bit position more or less directly, such as ffs() (find first set), clz() (count leading zeros), or popcount() (population count), and that we are to determine the bit position using just standard ISO C.

One approach is to spread each bit position in ch to a separate nibble (a four-bit group), then perform an in-register table lookup, where each table element comprises one nibble in a 32-bit int .

The expansion can be accomplished by squaring the input twice, which moves bit [i] to bit [4*i]. Code below then uses a special trick to allow extraction of the table element with a multiply and a right shift, where the multiply moves the desired table entry into bits [31:28] of the intermediate result. Note that the table is specified in a readable way and equates to the constant 0x01234567 , a substitution every reasonable compiler will make.

Compiler Explorer (Godbolt) shows that most of the execution time cost of uchar_bitpos() is three dependent integer multiplies plus a couple of other instructions.

This code assumes 8-bit char and 32-bit int . For better portability unsigned char variables could be turned into uint8_t variables and unsigned int variables could be turned into uint32_t variables.

#include <stdio.h>
#include <stdlib.h>

int uchar_bitpos (unsigned char ch)
{
    unsigned int ch_pow2, ch_pow4;
    const unsigned int table =
        ((0 << 28) | (1 << 24) | (2 << 20) | (3 << 16) | 
         (4 << 12) | (5 <<  8) | (6 <<  4) | (7 <<  0));
    ch_pow2 = ch * ch;
    ch_pow4 = ch_pow2 * ch_pow2;
    return (ch_pow4 * table) >> 28;
}

int main (void)
{
    unsigned char a = 0x80;
    do {
        printf ("a = %2x   bitpos=%d\n", a, uchar_bitpos (a));
        a = a / 2;
    } while (a);
    return EXIT_SUCCESS;
}

The output of the above program should look as follows:

a = 80   bitpos=7
a = 40   bitpos=6
a = 20   bitpos=5
a = 10   bitpos=4
a =  8   bitpos=3
a =  4   bitpos=2
a =  2   bitpos=1
a =  1   bitpos=0

write the most efficient code to get the index of that '1' bit.

The most efficient code would be to somehow map the value of ch to its bit index, ie:

0x01 -> 0
0x02 -> 1
0x04 -> 2
0x08 -> 3
...

Naive Mapping Table

The most simple and naive solution would require a lookup in a mapping table with all possible values of ch . For 8-bit numbers (char) we need a table with 2 8 = 256 elements:

char naive_table[256];

naive_table[0x01] = 0;
naive_table[0x02] = 1;
naive_table[0x04] = 2;
naive_table[0x08] = 3;
naive_table[0x10] = 4;
naive_table[0x20] = 5;
naive_table[0x40] = 6;
naive_table[0x80] = 7;

The lookup in this table is also simple:

index = naive_table[ch];

Hash Function + Mapping Table

The previous solution is simple and fast, but most of the element of naive_table are wasted. Taking into account that ch is a power of two, for any n -bit number there are just n possible indexes.

So, instead of using a mapping table with 2 8 elements, we could use a table with just 8 elements and a hash function which would map the value of ch to a unique index of the mapping table.

The perfect candidate for such a hash function would be a function using the de Bruijn sequence. There is a paper "Using de Bruijn Sequences to Index a 1 in a Computer Word" which states:

A length-n de Bruijn sequence, where n is an exact power of 2, is a cyclic sequence of n 0's and 1's such that every 0-1 sequence of length lg n occurs exactly once as a contiguous substring.

For example, a length-8 de Bruijn sequence is 00011101. Each 3-bit number occurs exactly once as a contiguous substring: starting from the leftmost 3 bits and moving a 3-bit window right one bit at a time, we have 000, 001, 011, 111, 110, 101, 010 (wrapping around), 100 (also wrapping around).

The hash function is computed by: h(x)=(x * deBruijn)>>(n - lg n)

So, let us try this hash function to get a unique index in our compact lookup table:

h(ch) = ((ch * 00011101b) >> (8 - 3)) & 0x7
h(ch) = ((ch * 29) >> 5) & 0x7

Let us calculate the hashes for all values of ch and make sure the hash function works as expected, ie all the hashes are unique:

ch    h(ch)
0x01  ((1 * 29) >> 5) & 0x7 = 0
0x02  ((2 * 29) >> 5) & 0x7 = 1
0x04  ((4 * 29) >> 5) & 0x7 = 3
0x08  ((8 * 29) >> 5) & 0x7 = 7
0x10  ((16 * 29) >> 5) & 0x7 = 6
0x20  ((32 * 29) >> 5) & 0x7 = 5
0x40  ((64 * 29) >> 5) & 0x7 = 2
0x80  ((64 * 29) >> 5) & 0x7 = 4

So the hash function works fine and produces unique hashes for each power of two value of ch .

Now let us create a compact mapping table using the hash values from the table above:

char compact_table[8];

compact_table[0] = 0;
compact_table[1] = 1;
compact_table[3] = 2;
compact_table[7] = 3;
compact_table[6] = 4;
compact_table[5] = 5;
compact_table[2] = 6;
compact_table[4] = 7;

Now for the lookup we use a hash value as an index:

h = ((ch * 29) >> 5) & 0x7;
index = compact_table[h];

Hash Function + Bit String

The previous version is nearly perfect: there are no more wasted elements in the mapping table. But since all the indexes are within 0-7 (ie just 3-bit values), there is still a room for improvement. Let us use a bit string instead of the mapping table so the most significant bits of each element are not wasted.

First, let us create such a bit string using all the values of ch and the hash values from the previous version:

ch    h(sh)  index
0x01  0      0 (000b)
0x02  1      1 (001b)
0x04  3      2 (010b)
0x08  7      3 (011b)
0x10  6      4 (100b)
0x20  5      5 (101b)
0x40  2      6 (110b)
0x80  4      7 (111b)

Now let us order this table by the hash value:

ch    h(sh)  index
0x01  0      0 (000b)
0x02  1      1 (001b)
0x40  2      6 (110b)
0x04  3      2 (010b)
0x80  4      7 (111b)
0x20  5      5 (101b)
0x10  6      4 (100b)
0x08  7      3 (011b)

So the bit string will be a reversed concatenation of those 3-bit indexes:

011 100 101 111 010 110 001 000 = 0x72f588

Now let us lookup in this bit string just like we did previously. Note that our indexes are 3-bit, so we need to multiply our hash value by 3:

h = ((ch * 29) >> 5) & 0x7; // just like before
bit_string = 0x72f588;
index = (bit_string >> (h * 3)) & 0x7;

Or in short:

index = (0x72f588 >> ((((ch * 29) >> 5) & 0x7) * 3)) & 0x7;

There are no divisions/modulos/conditions in the code, so it should perform fast on any CPU.

The prove of concept code:

unsigned char ch;
for (ch = 1; ch; ch <<= 1) {
        int index = (0x72f588 >> ((((ch * 29) >> 5) & 7) * 3)) & 7;
        printf("ch = 0x%02x index = %d\n", ch, index);
}
return 0;

A fast and quite portable solution is:

int charindex(unsigned char c){
    union {   /* Assume both float and int are 32 bits, assume IEEE 754 floating point. */
        int i;
        float f;
    } x;
    x.f = (float)c;
    return (x.i >> 23) - 127;
}

Note that many processors have hardware support for counting the number of leading or trailing zeros of an integer. With gcc it is easy to access these particular instructions: gcc has the builtin function __builtin_ctz() which is probably more efficient than charindex on platforms with suitable hardware support.

Number of code lines efficient could be a linear search through the bits.

short bit=0;
const char one=1;
while(!((ch >> bit) & one)) ++bit;

Of course error checking is probably a good idea so you could also add a check to make sure you are still in a valid bit.

short bit=0;
const char one=1;
while(++bit < 8 && !((ch >> bit) & one)) {}

It definitely isn't as computationally efficient, and it would fail to detect when more than one bit was set so the switch case is still probably the way to go for correctness.

This guy has fewer jumps in the assembly than a switch case does so maybe it is more efficient in computing the bit.

short bit=
    ch&0x2?1:
    (ch&0x4?2:
    (ch&0x8?3:
    (ch&0x10?4:
    (ch&0x20?5:
    (ch&0x40?6:
    (ch&0x80?7:8))))));

You could skip checking the last bit too and assume if nothing else matches the its the 7th bit is set which could save one comparison.

short bit=
    ch&0x2?1:
    (ch&0x4?2:
    (ch&0x8?3:
    (ch&0x10?4:
    (ch&0x20?5:
    (ch&0x40?6:7)))));

A few methods, that are not going to be hyper efficient( depending on your definition of efficiency).

Loop and shift method.

int ch = 32
int i;
for ( i=1;ch >>i ; i++) 
  printf("%i %i \n",i, ch>>i);
printf("Final index:%i\n",i-1);

Calling math.h log2

int l=log2((double)ch);
printf("math log2:%i\n",l);

More efficient: For a single lookup it is probably difficult to beat AnT's version. But for repeated lookups, a lookup table might perform better.

int ltable[256]= { -1 };

void initTable()
{
  ltable[0x01]=0;
  ltable[0x02]=1;
  ltable[0x04]=2;
  ltable[0x08]=3;
  ltable[0x10]=4;
  ltable[0x20]=5;
  ltable[0x40]=6;
  ltable[0x80]=7;
}

int lookup(size_t ch)
{
  return  ltable[ch];
}

Table init ASM

init():
  push rbp
  mov rbp, rsp
  mov DWORD PTR ltable[rip+4], 0
  mov DWORD PTR ltable[rip+8], 1
  mov DWORD PTR ltable[rip+16], 2
  mov DWORD PTR ltable[rip+32], 3
  mov DWORD PTR ltable[rip+64], 4
  mov DWORD PTR ltable[rip+128], 5
  mov DWORD PTR ltable[rip+256], 6
  mov DWORD PTR ltable[rip+512], 7
  nop
  pop rbp
  ret

Table lookup ASM

lookup(unsigned long):
  push rbp
  mov rbp, rsp
  mov QWORD PTR [rbp-8], rdi
  mov rax, QWORD PTR [rbp-8]
  mov eax, DWORD PTR ltable[0+rax*4]
  pop rbp
  ret

Outputs

 1 16 
 2 8 
 3 4 
 4 2 
 5 1 
 Final index:5
 math log2:5
 Lookup[32]=>5

You can use binary search technique here to reduce the number of comparison from 7 to 3.

assert((n & n-1) == 0);
if(n & 0x0F) {
    if(n & 0x03){
        if(n & 0x01){
            idx = 0;
        }
        else{
            idx = 1;
        }
    }else{
        if(n & 0x04){
            idx = 2;
        }
        else{
            idx = 4;
        }
    }
}else{
    if(n & 0x30){
        if(n & 0x10){
            idx = 3;
        }
        else{
            idx = 4;
        }
    }else{
        if(n & 0x40){
            idx = 5;
        }
        else{
            idx = 6;
        }
    }
}

Some architectures contain efficient (single instruction) implementation of popcount , available in C-compilers through intrinsics or __builtin_popcount() .

If this is the case, it will be hard to beat popcount(x - 1) , which will first convert the single set bit (1 << n) to a run of bits from (1 << (n-1)) .. 1, or 0 when x==1, then count the number of ones, which is the index of the original n .

Some comments point out ”Bit Scan Forward”, however, at least in x86 architectures that is inferior to popcount. Always know your HW...

If you have only one bit set to 1 , that means it is a power of 2 . You can directly get the index by taking log of ch . You have to use 2 based log of course.

The simplest solution might not be the fastest, but only profiling against other solutions will let you determine that, and only for a given architecture and compiler.

Here is a very simple solution:

#include <math.h>

int leadingbit(unsigned char c) {
    return log2(c);
}

Here is a solution with a lookup table:

int leadingbit(unsigned char c) {
#define N(x) ((076543210 / (x) / (x) / (x)) & 7)
#define N8(x) N(x), N(x+1), N(x+2), N(x+3), N(x+4), N(x+5), N(x+6), N(x+7)
#define N32(x) N8(x), N8(x+8), N8(x+16), N8(x+24)
    static unsigned char table[256] = {
        N32(0), N32(32), N32(64), N32(96), N32(128), N32(160), N32(192), N32(224),
    };
#undef N
#undef N8
#undef N32
    return table[c];
}

Here is one inspired by Matt Timmermans without a memory reference:

int leadingbit(unsigned char c) {
    int n = c - 1;
    n = ((n & 0xAA) >> 1) + (n & 0x55);  //sums of pairs of bits
    n = ((n & 0xCC) >> 2) + (n & 0x33);  //sums of 4s of bits
    return ((n >> 4) + n) & 7;
}

Here is one using the non portable builtin_clz() function (count leading zeroes):

#include <limits.h>

int leadingbit(unsigned char c) {
    return CHAR_BIT * sizeof(unsigned) - 1 - builtin_clz((unsigned)c);
}

Note that all of the above assume that c is a power of 2 , the behavior for other values is potentially undefined. You can check that c is a power of 2 with a simple expression:

if (c && !(c & (c - 1))) {
    /* c is a power of 2 */
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM