Is there a more efficient way of expanding a char to an uint64_t?

Question

I want to inflate an unsigned char to an uint64_t by repeating each bit 8 times. Eg

char -> uint64_t
0x00 -> 0x00
0x01 -> 0xFF
0x02 -> 0xFF00
0x03 -> 0xFFFF
0xAA -> 0xFF00FF00FF00FF00

I currently have the following implementation, using bit shifts to test if a bit is set, to accomplish this:

#include <stdint.h>
#include <inttypes.h>   

#define BIT_SET(var, pos) ((var) & (1 << (pos)))

static uint64_t inflate(unsigned char a)
{
    uint64_t MASK = 0xFF;
    uint64_t result = 0;
    for (int i = 0; i < 8; i++) {
        if (BIT_SET(a, i))
            result |= (MASK << (8 * i));    
    }

    return result;
}

However, I'm fairly new to C, so this fiddling with individual bits makes me a little vary that there might be a better (ie more efficient) way of doing this.

EDIT TO ADD
Ok, so after trying out the table lookup solution, here are the results. However, keep in mind that I didn't test the routine directly, but rather as part of bigger function (a multiplication of binary matrices to be precise), so this might have affected how the results turned out. So, on my computer, when multiplying a million 8x8 matrices, and compiled with:

  gcc -O2 -Wall -std=c99 foo.c

I got

./a.out original
real    0m0.127s
user    0m0.124s
sys     0m0.000s

./a.out table_lookup
real    0m0.012s
user    0m0.012s
sys     0m0.000s

So at least on my machine (a virtual machine 64 bit Linux Mint I should mention), the table lookup approach seems to provide a roughly 10-times speed-up, so I will accept that as the answer.

Answer 1

If you're looking for efficiency use a lookup table: a static array of 256 entries, each already holding the required result. You can use your code above to generate it.

Answer 2

In selected architectures (SSE,Neon) there are fast vector operations that can speed up this task or are designed to do this. Without special instructions the suggested look up table approach is both the fastest and most portable.

If the 2k size is an issue, parallel vector arithmetic operations can be simulated:

static uint64_t inflate_parallel(unsigned char a) {
  uint64_t vector = a * 0x0101010101010101ULL;
  // replicate the word all over qword
  // A5 becomes A5 A5 A5 A5 A5 A5 A5 A5
  vector &= 0x8040201008040201;  // becomes 80 00 20 00 00 04 00 01 <-- 
  vector += 0x00406070787c7e7f;  // becomes 80 40 80 70 78 80 7e 80
                                 // MSB is correct
  vector = (vector >> 7) & 0x0101010101010101ULL;  // LSB is correct
  return vector * 255;                             // all bits correct
}

EDIT : 2^31 iterations, (four time unroll to mitigate loop evaluation)

time ./parallel            time ./original            time ./lookup
real        0m2.038s       real       0m14.161s       real      0m1.436s
user        0m2.030s       user       0m14.120s       user      0m1.430s
sys         0m0.000s       sys        0m0.000s        sys       0m0.000s

That's about 7x speedup, while the lookup table gives ~10x speedup

Answer 3

You should profile what your code does, before worrying about optimising it.

On my compiler locally, your code gets entirely inlined, unrolled and turned into 8 constant test + or instructions when the value is unknown, and turned into a constant when the value is known at compile time. I could probably marginally improve it by removing a few branches, but the compiler is doing a reasonable job on its own.

Optimising the loop is then a bit pointless. A table lookup might be more efficient, but would probably prevent the compiler from making optimisations itself.

Answer 4

If you're willing to spend 256 * 8 = 2kB of memory on this (ie become less efficient in terms of memory, but more efficient in terms of CPU cycles needed), the most efficient way would be to pre-compute a lookup table:

static uint64_t inflate(unsigned char a) {
    static const uint64_t charToUInt64[256] = {
        0x0000000000000000, 0x00000000000000FF, 0x000000000000FF00, 0x000000000000FFFF,
        // ...
    };

    return charToUInt64[a];
}

Answer 5

The desired functionality can be achieved by moving each bit of the source into the lsb of the appropriate target byte (0 → 0, 1 → 8, 2 → 16, ...., 7 → 56), then expanding each lsb to cover the whole byte, which is easily done by multiplying with 0xff (255). Instead of moving bits into place individually using shifts, then combining the results, we can use an integer multiply to shift multiple bits in parallel. To prevent self-overlap, we can move only the least-significant seven source bits in this fashion, but need to move the source msb separately with a shift.

This leads to the following ISO-C99 implementation:

#include <stdint.h>

/* expand each bit in input into one byte in output */
uint64_t fast_inflate (uint8_t a)
{
    const uint64_t spread7 = (1ULL << 42) | (1ULL << 35) | (1ULL << 28) | (1ULL << 21) | 
                             (1ULL << 14) | (1ULL <<  7) | (1UL <<   0);
    const uint64_t byte_lsb = (1ULL << 56) | (1ULL << 48) | (1ULL << 40) | (1ULL << 32) |
                              (1ULL << 24) | (1ULL << 16) | (1ULL <<  8) | (1ULL <<  0);
    uint64_t r;
    /* spread bits to lsbs of each byte */
    r = (((uint64_t)(a & 0x7f) * spread7) + ((uint64_t)a << 49));
    /* extract the lsbs of all bytes */
    r = r & byte_lsb;
    /* fill each byte with its lsb */
    r = r * 0xff;
    return r;
}

#define BIT_SET(var, pos) ((var) & (1 << (pos)))
static uint64_t inflate(unsigned char a)
{
    uint64_t MASK = 0xFF;
    uint64_t result = 0;
    for (int i = 0; i < 8; i++) {
        if (BIT_SET(a, i))
            result |= (MASK << (8 * i));    
    }
    return result;
}

#include <stdio.h>
#include <stdlib.h>

int main (void)
{
    uint8_t a = 0;
    do {
        uint64_t res = fast_inflate (a);
        uint64_t ref = inflate (a);
        if (res != ref) {
            printf ("error @ %02x: fast_inflate = %016llx  inflate = %016llx\n", 
                    a, res, ref);
            return EXIT_FAILURE;
        }
        a++;
    } while (a);
    printf ("test passed\n");
    return EXIT_SUCCESS;
}

Most x64 compilers will compile fast_inflate() in straightforward manner. For example, my Intel compiler Version 13.1.3.198, when building with /Ox , generates the 11-instruction sequence below. Note that the final multiply with 0xff is actually implemented as a shift and subtract sequence.

fast_inflate    PROC 
        mov       rdx, 040810204081H
        movzx     r9d, cl
        and       ecx, 127
        mov       r8, 0101010101010101H
        imul      rdx, rcx
        shl       r9, 49
        add       r9, rdx
        and       r9, r8
        mov       rax, r9
        shl       rax, 8
        sub       rax, r9
        ret

Answer 6

Here is one more method using only simple arithmetics:

uint64_t inflate_chqrlie(uint8_t value) {
    uint64_t x = value;
    x = (x | (x << 28));
    x = (x | (x << 14));
    x = (x | (x <<  7)) & 0x0101010101010101ULL;
    x = (x << 8) - x;
    return x;
}

Another very efficient and concise one by phuclv using multiplication and mask:

static uint64_t inflate_phuclv(uint8_t b) {
    uint64_t MAGIC = 0x8040201008040201ULL;
    uint64_t MASK  = 0x8080808080808080ULL;
    return ((MAGIC * b) & MASK) >> 7;
}

And another with a small lookup table:

static uint32_t const lut_4_32[16] = {
    0x00000000, 0x000000FF, 0x0000FF00, 0x0000FFFF, 
    0x00FF0000, 0x00FF00FF, 0x00FFFF00, 0x00FFFFFF, 
    0xFF000000, 0xFF0000FF, 0xFF00FF00, 0xFF00FFFF, 
    0xFFFF0000, 0xFFFF00FF, 0xFFFFFF00, 0xFFFFFFFF, 
};

static uint64_t inflate_lut32(uint8_t b) {
    return lut_4_32[b & 15] | ((uint64_t)lut_4_32[b >> 4] << 32);
}

I wrote a benchmarking program to determine relative performance of the different approaches on my system (x86_64-apple-darwin16.7.0, Apple LLVM version 9.0.0 (clang-900.0.39.2, clang -O3).

The results show that my function inflate_chqrlie is faster than naive approaches but slower than other elaborate versions, all of which are beaten hands down by inflate_lut64 using a 2KB the lookup table in cache optimal situations.

The function inflate_lut32 , using a much smaller lookup table (64 bytes instead of 2KB) is not as fast as inflate_lut64 , but seems a good compromise for 32-bit architectures as it is still much faster than all other alternatives.

64-bit benchmark:

             inflate: 0, 848.316ms
        inflate_Curd: 0, 845.424ms
     inflate_chqrlie: 0, 371.502ms
 fast_inflate_njuffa: 0, 288.669ms
   inflate_parallel1: 0, 242.827ms
   inflate_parallel2: 0, 315.105ms
   inflate_parallel3: 0, 363.379ms
   inflate_parallel4: 0, 304.051ms
   inflate_parallel5: 0, 301.205ms
      inflate_phuclv: 0, 109.130ms
       inflate_lut32: 0, 197.178ms
       inflate_lut64: 0, 25.160ms

32-bit benchmark:

             inflate: 0, 1451.464ms
        inflate_Curd: 0, 955.509ms
     inflate_chqrlie: 0, 385.036ms
 fast_inflate_njuffa: 0, 463.212ms
   inflate_parallel1: 0, 468.070ms
   inflate_parallel2: 0, 570.107ms
   inflate_parallel3: 0, 511.741ms
   inflate_parallel4: 0, 601.892ms
   inflate_parallel5: 0, 506.695ms
      inflate_phuclv: 0, 192.431ms
       inflate_lut32: 0, 140.968ms
       inflate_lut64: 0, 28.776ms

Here is the code:

#include <stdio.h>
#include <stdint.h>
#include <time.h>

static uint64_t inflate(unsigned char a) {
#define BIT_SET(var, pos) ((var) & (1 << (pos)))
    uint64_t MASK = 0xFF;
    uint64_t result = 0;
    for (int i = 0; i < 8; i++) {
        if (BIT_SET(a, i))
            result |= (MASK << (8 * i));
    }

    return result;
}

static uint64_t inflate_Curd(unsigned char a) {
    uint64_t mask = 0xFF;
    uint64_t result = 0;
    for (int i = 0; i < 8; i++) {
        if (a & 1)
            result |= mask;
        mask <<= 8;
        a >>= 1;
    }
    return result;
}

uint64_t inflate_chqrlie(uint8_t value) {
    uint64_t x = value;
    x = (x | (x << 28));
    x = (x | (x << 14));
    x = (x | (x <<  7)) & 0x0101010101010101ULL;
    x = (x << 8) - x;
    return x;
}

uint64_t fast_inflate_njuffa(uint8_t a) {
    const uint64_t spread7 = (1ULL << 42) | (1ULL << 35) | (1ULL << 28) | (1ULL << 21) |
        (1ULL << 14) | (1ULL <<  7) | (1UL <<   0);
    const uint64_t byte_lsb = (1ULL << 56) | (1ULL << 48) | (1ULL << 40) | (1ULL << 32) |
        (1ULL << 24) | (1ULL << 16) | (1ULL <<  8) | (1ULL <<  0);
    uint64_t r;
    /* spread bits to lsbs of each byte */
    r = (((uint64_t)(a & 0x7f) * spread7) + ((uint64_t)a << 49));
    /* extract the lsbs of all bytes */
    r = r & byte_lsb;
    /* fill each byte with its lsb */
    r = r * 0xff;
    return r;
}

// Aki Suuihkonen: 1.265
static uint64_t inflate_parallel1(unsigned char a) {
    uint64_t vector = a * 0x0101010101010101ULL;
    // replicate the word all over qword
    // A5 becomes A5 A5 A5 A5 A5 A5 A5 A5
    vector &= 0x8040201008040201;  // becomes 80 00 20 00 00 04 00 01 <--
    vector += 0x00406070787c7e7f;  // becomes 80 40 80 70 78 80 7e 80
    // MSB is correct
    vector = (vector >> 7) & 0x0101010101010101ULL;  // LSB is correct
    return vector * 255;                             // all bits correct
}

// By seizet and then combine: 1.583
static uint64_t inflate_parallel2(unsigned char a) {
    uint64_t vector1 = a * 0x0002000800200080ULL;
    uint64_t vector2 = a * 0x0000040010004001ULL;
    uint64_t vector = (vector1 & 0x0100010001000100ULL) | (vector2 & 0x0001000100010001ULL);
    return vector * 255;
}

// Stay in 32 bits as much as possible: 1.006
static uint64_t inflate_parallel3(unsigned char a) {
    uint32_t vector1 = (( (a & 0x0F)       * 0x00204081) & 0x01010101) * 255;
    uint32_t vector2 = ((((a & 0xF0) >> 4) * 0x00204081) & 0x01010101) * 255;
    return (((uint64_t)vector2) << 32) | vector1;
}

// Do the common computation in 64 bits: 0.915
static uint64_t inflate_parallel4(unsigned char a) {
    uint32_t vector1 =  (a & 0x0F)       * 0x00204081;
    uint32_t vector2 = ((a & 0xF0) >> 4) * 0x00204081;
    uint64_t vector = (vector1 | (((uint64_t)vector2) << 32)) & 0x0101010101010101ULL;
    return vector * 255;
}

// Some computation is done in 64 bits a little sooner: 0.806
static uint64_t inflate_parallel5(unsigned char a) {
    uint32_t vector1 = (a & 0x0F) * 0x00204081;
    uint64_t vector2 = (a & 0xF0) * 0x002040810000000ULL;
    uint64_t vector = (vector1 | vector2) & 0x0101010101010101ULL;
    return vector * 255;
}

static uint64_t inflate_phuclv(uint8_t b) {
    uint64_t MAGIC = 0x8040201008040201ULL;
    uint64_t MASK  = 0x8080808080808080ULL;
    return ((MAGIC * b) & MASK) >> 7;
}

static uint32_t const lut_4_32[16] = {
    0x00000000, 0x000000FF, 0x0000FF00, 0x0000FFFF, 
    0x00FF0000, 0x00FF00FF, 0x00FFFF00, 0x00FFFFFF, 
    0xFF000000, 0xFF0000FF, 0xFF00FF00, 0xFF00FFFF, 
    0xFFFF0000, 0xFFFF00FF, 0xFFFFFF00, 0xFFFFFFFF, 
};

static uint64_t inflate_lut32(uint8_t b) {
    return lut_4_32[b & 15] | ((uint64_t)lut_4_32[b >> 4] << 32);
}

static uint64_t lut_8_64[256];

static uint64_t inflate_lut64(uint8_t b) {
    return lut_8_64[b];
}

#define ITER  1000000

int main() {
    clock_t t;
    uint64_t x;

    for (int b = 0; b < 256; b++)
        lut_8_64[b] = inflate((uint8_t)b);

#define TEST(func)  do {                                \
        t = clock();                                    \
        x = 0;                                          \
        for (int i = 0; i < ITER; i++) {                \
            for (int b = 0; b < 256; b++)               \
                x ^= func((uint8_t)b);                  \
        }                                               \
        t = clock() - t;                                \
        printf("%20s: %llu, %.3fms\n",                  \
               #func, x, t * 1000.0 / CLOCKS_PER_SEC);  \
       } while (0)

    TEST(inflate);
    TEST(inflate_Curd);
    TEST(inflate_chqrlie);
    TEST(fast_inflate_njuffa);
    TEST(inflate_parallel1);
    TEST(inflate_parallel2);
    TEST(inflate_parallel3);
    TEST(inflate_parallel4);
    TEST(inflate_parallel5);
    TEST(inflate_phuclv);
    TEST(inflate_lut32);
    TEST(inflate_lut64);

    return 0;
}

Answer 7

Two minor optimizations:
One for testing the bits in the input (a will be destroyed but this doesn't matter)
The other for shifting the mask.

static uint64_t inflate(unsigned char a)
{
    uint64_t mask = 0xFF;
    uint64_t result = 0;
    for (int i = 0; i < 8; i++) {
        if (a & 1)
            result |= mask;
        mask <<= 8;    
        a >>= 1;
    }

    return result;
}

Maybe you can also replace the 'for (int i = 0; i < 8; i++)'-loop by a 'while (a)'-loop. This works, however, only if the right shift a >>=1 works unsigned (As much as I know C standard allows the compiler to do it signed or unsigned). Otherwise you will have an infinite loop in some cases.

EDIT:
To see the result I compiled both variants with gcc -std=c99 -S source.c . A quick glance at the resulting assembler outputs shows that the optimization shown above yields ca. 1/3 viewer instructions, most of them inside the loop.

Answer 8

Variations on the same theme as @Aki answer. Some of them are better here, but it may depend on your compiler and target machines (they should be more suitable for superscalar processor that Aki's function even if they do more work as there is less data dependencies)

// Aki Suuihkonen: 1.265
static uint64_t inflate_parallel1(unsigned char a) {
  uint64_t vector = a * 0x0101010101010101ULL;
  vector &= 0x8040201008040201;
  vector += 0x00406070787c7e7f;
  vector = (vector >> 7) & 0x0101010101010101ULL; 
  return vector * 255;
}

// By seizet and then combine: 1.583
static uint64_t inflate_parallel2(unsigned char a) {
    uint64_t vector1 = a * 0x0002000800200080ULL;
    uint64_t vector2 = a * 0x0000040010004001ULL;
    uint64_t vector = (vector1 & 0x0100010001000100ULL) | (vector2 & 0x0001000100010001ULL);
    return vector * 255;
}

// Stay in 32 bits as much as possible: 1.006
static uint64_t inflate_parallel3(unsigned char a) {
    uint32_t vector1 = (( (a & 0x0F)       * 0x00204081) & 0x01010101) * 255;
    uint32_t vector2 = ((((a & 0xF0) >> 4) * 0x00204081) & 0x01010101) * 255;
    return (((uint64_t)vector2) << 32) | vector1;
}

// Do the common computation in 64 bits: 0.915
static uint64_t inflate_parallel4(unsigned char a) {
    uint32_t vector1 =  (a & 0x0F)       * 0x00204081;
    uint32_t vector2 = ((a & 0xF0) >> 4) * 0x00204081;
    uint64_t vector = (vector1 | (((uint64_t)vector2) << 32)) & 0x0101010101010101ULL;
    return vector * 255;
}

// Some computation is done in 64 bits a little sooner: 0.806
static uint64_t inflate_parallel5(unsigned char a) {
    uint32_t vector1 = (a & 0x0F) * 0x00204081;
    uint64_t vector2 = (a & 0xF0) * 0x002040810000000ULL;
    uint64_t vector = (vector1 | vector2) & 0x0101010101010101ULL;
    return vector * 255;
}

Is there a more efficient way of expanding a char to an uint64_t?

Question

8 answers

solution1
7 ACCPTED 2013-01-11 09:53:12

solution2
5 2013-01-11 10:39:16

solution3
3 2013-01-11 10:03:25

solution4
2 2013-01-11 09:59:02

solution5
2 2019-03-08 07:38:43

solution6
1 2019-03-08 09:09:54

solution7
0 2013-01-11 09:57:30

solution8
0 2013-01-11 12:51:42

Is there a more efficient way of expanding a char to an uint64_t?

Question

8 answers

solution1 7 ACCPTED 2013-01-11 09:53:12

solution2 5 2013-01-11 10:39:16

solution3 3 2013-01-11 10:03:25

solution4 2 2013-01-11 09:59:02

solution5 2 2019-03-08 07:38:43

solution6 1 2019-03-08 09:09:54

solution7 0 2013-01-11 09:57:30

solution8 0 2013-01-11 12:51:42

solution1
7 ACCPTED 2013-01-11 09:53:12

solution2
5 2013-01-11 10:39:16

solution3
3 2013-01-11 10:03:25

solution4
2 2013-01-11 09:59:02

solution5
2 2019-03-08 07:38:43

solution6
1 2019-03-08 09:09:54

solution7
0 2013-01-11 09:57:30

solution8
0 2013-01-11 12:51:42