简体   繁体   中英

“Counting number of bits set”- why this algorithm?

I came across this algorithm submitted here: How does this algorithm to count the number of set bits in a 32-bit integer work?

I ran the algorithm submitted by "yeer" in the link above as both the algorithms more or less looked the same.

I wrote the traditional(slower method,assuming) code to check how much performance has improved.

Yeer Code:

unsigned int number=0xFFFFFFFF;

number= (number & 0x55555555) + ((number>>1) & 0x55555555);
number= (number & 0x33333333) + ((number>>2) & 0x33333333);
number= (number & 0x0F0F0F0F) + ((number>>4) & 0x0F0F0F0F);
number= (number & 0x00FF00FF) + ((number>>8) & 0x00FF00FF);
number= (number & 0x0000FFFF) + ((number>>16) & 0x0000FFFF);

printf("%d",number);

traditional way:

unsigned int number=0xFFFFFFFF;
unsigned char i=0;
unsigned char count=0;

for(i=0;i<32;i++)
{
 if((number>>i) & 1)
  count++;
}

 printf("%d",count);

The second code outbeats "yeers" method.

For input value 0xFF(using variable as unsigned char), Traditional= 0.047s, Other= 0.063s For input value 0xFFFFFFFF(using variable as unsigned int), Traditional= 0.141s, Other= 0.141s

Whats so special in the other algorithm?

I used Codeblocks IDE to run both the codes.

I ran a simple benchmark test for the diffeent codes. Each snippet was executed 100 million times. The code was compiled gcc with no optmization flags. Each case was exceuted a couple of times to make sure the result was not distorted unduly by other system activity. The execution times varied less than 10% when re-run.

As you can see, the algorithm submitted by yeer is much faster than the other algorithms.

Test driver:

int main(int argc, char *argv[]) {
    int i, result;
    for (i= 0; i < 100000000; i++) {
        result = SWAR(i);
    }
    return 0;
}

Code 1, the optimized yeer code:

int SWAR(unsigned int i) {
    i = i - ((i >> 1) & 0x55555555);
    i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
    return (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24;
}

Time: 0.772s

Code 2, the un-optimized version:

int SWAR(unsigned int number) {
    number= (number & 0x55555555) + ((number>>1) & 0x55555555);
    number= (number & 0x33333333) + ((number>>2) & 0x33333333);
    number= (number & 0x0F0F0F0F) + ((number>>4) & 0x0F0F0F0F);
    number= (number & 0x00FF00FF) + ((number>>8) & 0x00FF00FF);
    number= (number & 0x0000FFFF) + ((number>>16) & 0x0000FFFF);
    return number;
}

Time: 1.241s

Code 3, bit counter without if :

int SWAR(unsigned int number) {
  int i, count = 0;
  for(i=0;i<32;i++) {
    count += (number>>i) & 1;
  }
    return count;
}

Time: 8.921s

Code 4, bit counter with if :

int SWAR(unsigned int number) {
  int i, count = 0;
  for(i=0;i<32;i++) {
    if ((number>>i) & 1) {
        count++;
    }
  }
    return count;
}

Time: 21.058s

The first method has no branching and on most systems will result in about 15 machine code cpu instructions (5 adda, 5 shifts and 5 and's).

You method has typically 128 instructions (32 lots of 4 instructions) and even with predictive branching most cpus will have to dump their pipeline at least once when the loop condition is wrongly estimated, resulting in +130 cpu cycles to run get a result.

I would suggest you try running it millions of time on random data and you will see the difference.

Try a for loop of 1 to a million over an array that has been set with data from rand()

Your times are probably other things and nothing to do with your code which would execute in microseconds

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM