简体   繁体   English

有没有更有效的方法将char扩展为uint64_t?

[英]Is there a more efficient way of expanding a char to an uint64_t?

I want to inflate an unsigned char to an uint64_t by repeating each bit 8 times. 我想通过重复每个位8次将unsigned char扩展到uint64_t Eg 例如

char -> uint64_t
0x00 -> 0x00
0x01 -> 0xFF
0x02 -> 0xFF00
0x03 -> 0xFFFF
0xAA -> 0xFF00FF00FF00FF00

I currently have the following implementation, using bit shifts to test if a bit is set, to accomplish this: 我目前有以下实现,使用位移来测试是否设置了一个位,以实现此目的:

#include <stdint.h>
#include <inttypes.h>   

#define BIT_SET(var, pos) ((var) & (1 << (pos)))

static uint64_t inflate(unsigned char a)
{
    uint64_t MASK = 0xFF;
    uint64_t result = 0;
    for (int i = 0; i < 8; i++) {
        if (BIT_SET(a, i))
            result |= (MASK << (8 * i));    
    }

    return result;
} 

However, I'm fairly new to C, so this fiddling with individual bits makes me a little vary that there might be a better (ie more efficient) way of doing this. 但是,我对C来说还是个新手,所以这个摆弄个别位的东西让我有点不同,可能会有更好的(即更有效的)方法。

EDIT TO ADD 编辑添加
Ok, so after trying out the table lookup solution, here are the results. 好的,所以在尝试了表查找解决方案后,结果如下。 However, keep in mind that I didn't test the routine directly, but rather as part of bigger function (a multiplication of binary matrices to be precise), so this might have affected how the results turned out. 但是,请记住,我没有直接测试例程,而是作为更大函数的一部分(确切地说是二进制矩阵的乘法),因此这可能会影响结果的结果。 So, on my computer, when multiplying a million 8x8 matrices, and compiled with: 因此,在我的计算机上,当乘以一百万个8x8矩阵时,编译为:

  gcc -O2 -Wall -std=c99 foo.c

I got 我有

./a.out original
real    0m0.127s
user    0m0.124s
sys     0m0.000s

./a.out table_lookup
real    0m0.012s
user    0m0.012s
sys     0m0.000s

So at least on my machine (a virtual machine 64 bit Linux Mint I should mention), the table lookup approach seems to provide a roughly 10-times speed-up, so I will accept that as the answer. 所以至少在我的机器上(虚拟机64位Linux Mint我应该提到),表查找方法似乎提供了大约10倍的加速,所以我接受这个作为答案。

If you're looking for efficiency use a lookup table: a static array of 256 entries, each already holding the required result. 如果您正在寻找效率,请使用查找表:256个条目的静态数组,每个条目都已保存所需的结果。 You can use your code above to generate it. 您可以使用上面的代码生成它。

In selected architectures (SSE,Neon) there are fast vector operations that can speed up this task or are designed to do this. 在选定的体系结构(SSE,Neon)中,存在快速向量操作,可以加速此任务或旨在实现此目的。 Without special instructions the suggested look up table approach is both the fastest and most portable. 如果没有特别说明,建议的查找表方法既快又便携。

If the 2k size is an issue, parallel vector arithmetic operations can be simulated: 如果2k大小是个问题,可以模拟并行向量算术运算:

static uint64_t inflate_parallel(unsigned char a) {
  uint64_t vector = a * 0x0101010101010101ULL;
  // replicate the word all over qword
  // A5 becomes A5 A5 A5 A5 A5 A5 A5 A5
  vector &= 0x8040201008040201;  // becomes 80 00 20 00 00 04 00 01 <-- 
  vector += 0x00406070787c7e7f;  // becomes 80 40 80 70 78 80 7e 80
                                 // MSB is correct
  vector = (vector >> 7) & 0x0101010101010101ULL;  // LSB is correct
  return vector * 255;                             // all bits correct
}

EDIT : 2^31 iterations, (four time unroll to mitigate loop evaluation) 编辑 :2 ^ 31次迭代,(四次展开以减轻循环评估)

time ./parallel            time ./original            time ./lookup
real        0m2.038s       real       0m14.161s       real      0m1.436s
user        0m2.030s       user       0m14.120s       user      0m1.430s
sys         0m0.000s       sys        0m0.000s        sys       0m0.000s

That's about 7x speedup, while the lookup table gives ~10x speedup 这大约是7倍的加速,而查找表提供了大约10倍的加速

You should profile what your code does, before worrying about optimising it. 在担心优化代码之前,您应该分析代码的作用。

On my compiler locally, your code gets entirely inlined, unrolled and turned into 8 constant test + or instructions when the value is unknown, and turned into a constant when the value is known at compile time. 在我的本地编译器上,您的代码完全内联,展开并在值未知时变为8个常量test +或指令,并在编译时知道该值时变为常量。 I could probably marginally improve it by removing a few branches, but the compiler is doing a reasonable job on its own. 我可以通过删除一些分支来略微改进它,但编译器自己做了一个合理的工作。

Optimising the loop is then a bit pointless. 然后优化循环有点毫无意义。 A table lookup might be more efficient, but would probably prevent the compiler from making optimisations itself. 表查找可能更有效,但可能会阻止编译器自行进行优化。

If you're willing to spend 256 * 8 = 2kB of memory on this (ie become less efficient in terms of memory, but more efficient in terms of CPU cycles needed), the most efficient way would be to pre-compute a lookup table: 如果你愿意花费256 * 8 = 2kB的内存(即在内存方面效率降低,但在所需的CPU周期方面效率更高),最有效的方法是预先计算查找表:

static uint64_t inflate(unsigned char a) {
    static const uint64_t charToUInt64[256] = {
        0x0000000000000000, 0x00000000000000FF, 0x000000000000FF00, 0x000000000000FFFF,
        // ...
    };

    return charToUInt64[a];
}

The desired functionality can be achieved by moving each bit of the source into the lsb of the appropriate target byte (0 → 0, 1 → 8, 2 → 16, ...., 7 → 56), then expanding each lsb to cover the whole byte, which is easily done by multiplying with 0xff (255). 通过将源的每个位移动到相应目标字节的lsb(0→0,1→8,2→16,......,7→56),然后将每个lsb扩展到覆盖,可以实现所需的功能整个字节,可以通过乘以0xff (255)轻松完成。 Instead of moving bits into place individually using shifts, then combining the results, we can use an integer multiply to shift multiple bits in parallel. 不是使用移位将位移动到位,然后组合结果,我们可以使用整数乘法来并行移位多个位。 To prevent self-overlap, we can move only the least-significant seven source bits in this fashion, but need to move the source msb separately with a shift. 为了防止自重叠,我们只能以这种方式移动最不重要的七个源位,但需要通过移位单独移动源msb。

This leads to the following ISO-C99 implementation: 这导致以下ISO-C99实现:

#include <stdint.h>

/* expand each bit in input into one byte in output */
uint64_t fast_inflate (uint8_t a)
{
    const uint64_t spread7 = (1ULL << 42) | (1ULL << 35) | (1ULL << 28) | (1ULL << 21) | 
                             (1ULL << 14) | (1ULL <<  7) | (1UL <<   0);
    const uint64_t byte_lsb = (1ULL << 56) | (1ULL << 48) | (1ULL << 40) | (1ULL << 32) |
                              (1ULL << 24) | (1ULL << 16) | (1ULL <<  8) | (1ULL <<  0);
    uint64_t r;
    /* spread bits to lsbs of each byte */
    r = (((uint64_t)(a & 0x7f) * spread7) + ((uint64_t)a << 49));
    /* extract the lsbs of all bytes */
    r = r & byte_lsb;
    /* fill each byte with its lsb */
    r = r * 0xff;
    return r;
}

#define BIT_SET(var, pos) ((var) & (1 << (pos)))
static uint64_t inflate(unsigned char a)
{
    uint64_t MASK = 0xFF;
    uint64_t result = 0;
    for (int i = 0; i < 8; i++) {
        if (BIT_SET(a, i))
            result |= (MASK << (8 * i));    
    }
    return result;
}

#include <stdio.h>
#include <stdlib.h>

int main (void)
{
    uint8_t a = 0;
    do {
        uint64_t res = fast_inflate (a);
        uint64_t ref = inflate (a);
        if (res != ref) {
            printf ("error @ %02x: fast_inflate = %016llx  inflate = %016llx\n", 
                    a, res, ref);
            return EXIT_FAILURE;
        }
        a++;
    } while (a);
    printf ("test passed\n");
    return EXIT_SUCCESS;
}

Most x64 compilers will compile fast_inflate() in straightforward manner. 大多数x64编译器都会以简单的方式编译fast_inflate() For example, my Intel compiler Version 13.1.3.198, when building with /Ox , generates the 11-instruction sequence below. 例如,我的英特尔编译器版本13.1.3.198,当使用/Ox构建时,会在下面生成11指令序列。 Note that the final multiply with 0xff is actually implemented as a shift and subtract sequence. 注意,最后乘以0xff实际上是作为移位和减法序列实现的。

fast_inflate    PROC 
        mov       rdx, 040810204081H
        movzx     r9d, cl
        and       ecx, 127
        mov       r8, 0101010101010101H
        imul      rdx, rcx
        shl       r9, 49
        add       r9, rdx
        and       r9, r8
        mov       rax, r9
        shl       rax, 8
        sub       rax, r9
        ret

Here is one more method using only simple arithmetics: 这是另一种仅使用简单算术的方法:

uint64_t inflate_chqrlie(uint8_t value) {
    uint64_t x = value;
    x = (x | (x << 28));
    x = (x | (x << 14));
    x = (x | (x <<  7)) & 0x0101010101010101ULL;
    x = (x << 8) - x;
    return x;
}

Another very efficient and concise one by phuclv using multiplication and mask: 另一个非常有效和简洁的phuclv使用乘法和掩码:

static uint64_t inflate_phuclv(uint8_t b) {
    uint64_t MAGIC = 0x8040201008040201ULL;
    uint64_t MASK  = 0x8080808080808080ULL;
    return ((MAGIC * b) & MASK) >> 7;
}

And another with a small lookup table: 还有一个小查找表:

static uint32_t const lut_4_32[16] = {
    0x00000000, 0x000000FF, 0x0000FF00, 0x0000FFFF, 
    0x00FF0000, 0x00FF00FF, 0x00FFFF00, 0x00FFFFFF, 
    0xFF000000, 0xFF0000FF, 0xFF00FF00, 0xFF00FFFF, 
    0xFFFF0000, 0xFFFF00FF, 0xFFFFFF00, 0xFFFFFFFF, 
};

static uint64_t inflate_lut32(uint8_t b) {
    return lut_4_32[b & 15] | ((uint64_t)lut_4_32[b >> 4] << 32);
}

I wrote a benchmarking program to determine relative performance of the different approaches on my system (x86_64-apple-darwin16.7.0, Apple LLVM version 9.0.0 (clang-900.0.39.2, clang -O3). 我编写了一个基准测试程序来确定我系统上不同方法的相对性能(x86_64-apple-darwin16.7.0,Apple LLVM 9.0.0版(clang-900.0.39.2,clang -O3)。

The results show that my function inflate_chqrlie is faster than naive approaches but slower than other elaborate versions, all of which are beaten hands down by inflate_lut64 using a 2KB the lookup table in cache optimal situations. 结果显示我的函数inflate_chqrlie比天真的方法更快,但比其他复杂的版本慢,所有这些都被inflate_lut64在缓存最佳情况下使用2KB的查找表击败。

The function inflate_lut32 , using a much smaller lookup table (64 bytes instead of 2KB) is not as fast as inflate_lut64 , but seems a good compromise for 32-bit architectures as it is still much faster than all other alternatives. 使用更小的查找表(64字节而不是2KB)的函数inflate_lut32没有inflate_lut64那么快,但似乎是32位体系结构的一个很好的折衷方案,因为它仍然比所有其他替代方案快得多。

64-bit benchmark: 64位基准:

             inflate: 0, 848.316ms
        inflate_Curd: 0, 845.424ms
     inflate_chqrlie: 0, 371.502ms
 fast_inflate_njuffa: 0, 288.669ms
   inflate_parallel1: 0, 242.827ms
   inflate_parallel2: 0, 315.105ms
   inflate_parallel3: 0, 363.379ms
   inflate_parallel4: 0, 304.051ms
   inflate_parallel5: 0, 301.205ms
      inflate_phuclv: 0, 109.130ms
       inflate_lut32: 0, 197.178ms
       inflate_lut64: 0, 25.160ms

32-bit benchmark: 32位基准:

             inflate: 0, 1451.464ms
        inflate_Curd: 0, 955.509ms
     inflate_chqrlie: 0, 385.036ms
 fast_inflate_njuffa: 0, 463.212ms
   inflate_parallel1: 0, 468.070ms
   inflate_parallel2: 0, 570.107ms
   inflate_parallel3: 0, 511.741ms
   inflate_parallel4: 0, 601.892ms
   inflate_parallel5: 0, 506.695ms
      inflate_phuclv: 0, 192.431ms
       inflate_lut32: 0, 140.968ms
       inflate_lut64: 0, 28.776ms

Here is the code: 这是代码:

#include <stdio.h>
#include <stdint.h>
#include <time.h>

static uint64_t inflate(unsigned char a) {
#define BIT_SET(var, pos) ((var) & (1 << (pos)))
    uint64_t MASK = 0xFF;
    uint64_t result = 0;
    for (int i = 0; i < 8; i++) {
        if (BIT_SET(a, i))
            result |= (MASK << (8 * i));
    }

    return result;
}

static uint64_t inflate_Curd(unsigned char a) {
    uint64_t mask = 0xFF;
    uint64_t result = 0;
    for (int i = 0; i < 8; i++) {
        if (a & 1)
            result |= mask;
        mask <<= 8;
        a >>= 1;
    }
    return result;
}

uint64_t inflate_chqrlie(uint8_t value) {
    uint64_t x = value;
    x = (x | (x << 28));
    x = (x | (x << 14));
    x = (x | (x <<  7)) & 0x0101010101010101ULL;
    x = (x << 8) - x;
    return x;
}

uint64_t fast_inflate_njuffa(uint8_t a) {
    const uint64_t spread7 = (1ULL << 42) | (1ULL << 35) | (1ULL << 28) | (1ULL << 21) |
        (1ULL << 14) | (1ULL <<  7) | (1UL <<   0);
    const uint64_t byte_lsb = (1ULL << 56) | (1ULL << 48) | (1ULL << 40) | (1ULL << 32) |
        (1ULL << 24) | (1ULL << 16) | (1ULL <<  8) | (1ULL <<  0);
    uint64_t r;
    /* spread bits to lsbs of each byte */
    r = (((uint64_t)(a & 0x7f) * spread7) + ((uint64_t)a << 49));
    /* extract the lsbs of all bytes */
    r = r & byte_lsb;
    /* fill each byte with its lsb */
    r = r * 0xff;
    return r;
}

// Aki Suuihkonen: 1.265
static uint64_t inflate_parallel1(unsigned char a) {
    uint64_t vector = a * 0x0101010101010101ULL;
    // replicate the word all over qword
    // A5 becomes A5 A5 A5 A5 A5 A5 A5 A5
    vector &= 0x8040201008040201;  // becomes 80 00 20 00 00 04 00 01 <--
    vector += 0x00406070787c7e7f;  // becomes 80 40 80 70 78 80 7e 80
    // MSB is correct
    vector = (vector >> 7) & 0x0101010101010101ULL;  // LSB is correct
    return vector * 255;                             // all bits correct
}

// By seizet and then combine: 1.583
static uint64_t inflate_parallel2(unsigned char a) {
    uint64_t vector1 = a * 0x0002000800200080ULL;
    uint64_t vector2 = a * 0x0000040010004001ULL;
    uint64_t vector = (vector1 & 0x0100010001000100ULL) | (vector2 & 0x0001000100010001ULL);
    return vector * 255;
}

// Stay in 32 bits as much as possible: 1.006
static uint64_t inflate_parallel3(unsigned char a) {
    uint32_t vector1 = (( (a & 0x0F)       * 0x00204081) & 0x01010101) * 255;
    uint32_t vector2 = ((((a & 0xF0) >> 4) * 0x00204081) & 0x01010101) * 255;
    return (((uint64_t)vector2) << 32) | vector1;
}

// Do the common computation in 64 bits: 0.915
static uint64_t inflate_parallel4(unsigned char a) {
    uint32_t vector1 =  (a & 0x0F)       * 0x00204081;
    uint32_t vector2 = ((a & 0xF0) >> 4) * 0x00204081;
    uint64_t vector = (vector1 | (((uint64_t)vector2) << 32)) & 0x0101010101010101ULL;
    return vector * 255;
}

// Some computation is done in 64 bits a little sooner: 0.806
static uint64_t inflate_parallel5(unsigned char a) {
    uint32_t vector1 = (a & 0x0F) * 0x00204081;
    uint64_t vector2 = (a & 0xF0) * 0x002040810000000ULL;
    uint64_t vector = (vector1 | vector2) & 0x0101010101010101ULL;
    return vector * 255;
}

static uint64_t inflate_phuclv(uint8_t b) {
    uint64_t MAGIC = 0x8040201008040201ULL;
    uint64_t MASK  = 0x8080808080808080ULL;
    return ((MAGIC * b) & MASK) >> 7;
}

static uint32_t const lut_4_32[16] = {
    0x00000000, 0x000000FF, 0x0000FF00, 0x0000FFFF, 
    0x00FF0000, 0x00FF00FF, 0x00FFFF00, 0x00FFFFFF, 
    0xFF000000, 0xFF0000FF, 0xFF00FF00, 0xFF00FFFF, 
    0xFFFF0000, 0xFFFF00FF, 0xFFFFFF00, 0xFFFFFFFF, 
};

static uint64_t inflate_lut32(uint8_t b) {
    return lut_4_32[b & 15] | ((uint64_t)lut_4_32[b >> 4] << 32);
}

static uint64_t lut_8_64[256];

static uint64_t inflate_lut64(uint8_t b) {
    return lut_8_64[b];
}

#define ITER  1000000

int main() {
    clock_t t;
    uint64_t x;

    for (int b = 0; b < 256; b++)
        lut_8_64[b] = inflate((uint8_t)b);

#define TEST(func)  do {                                \
        t = clock();                                    \
        x = 0;                                          \
        for (int i = 0; i < ITER; i++) {                \
            for (int b = 0; b < 256; b++)               \
                x ^= func((uint8_t)b);                  \
        }                                               \
        t = clock() - t;                                \
        printf("%20s: %llu, %.3fms\n",                  \
               #func, x, t * 1000.0 / CLOCKS_PER_SEC);  \
       } while (0)

    TEST(inflate);
    TEST(inflate_Curd);
    TEST(inflate_chqrlie);
    TEST(fast_inflate_njuffa);
    TEST(inflate_parallel1);
    TEST(inflate_parallel2);
    TEST(inflate_parallel3);
    TEST(inflate_parallel4);
    TEST(inflate_parallel5);
    TEST(inflate_phuclv);
    TEST(inflate_lut32);
    TEST(inflate_lut64);

    return 0;
}

Two minor optimizations: 两个小优化:
One for testing the bits in the input (a will be destroyed but this doesn't matter) 一个用于测试输入中的位(a将被销毁,但这无关紧要)
The other for shifting the mask. 另一个用于移动掩模。

static uint64_t inflate(unsigned char a)
{
    uint64_t mask = 0xFF;
    uint64_t result = 0;
    for (int i = 0; i < 8; i++) {
        if (a & 1)
            result |= mask;
        mask <<= 8;    
        a >>= 1;
    }

    return result;
} 

Maybe you can also replace the 'for (int i = 0; i < 8; i++)'-loop by a 'while (a)'-loop. 也许你也可以用'while(a)'循环替换'for(int i = 0; i <8; i ++)' - 循环。 This works, however, only if the right shift a >>=1 works unsigned (As much as I know C standard allows the compiler to do it signed or unsigned). 但是,只有当右移>> >> =无符号时才有效(因为我知道C标准允许编译器执行有符号或无符号)。 Otherwise you will have an infinite loop in some cases. 否则在某些情况下你会有一个无限循环。

EDIT: 编辑:
To see the result I compiled both variants with gcc -std=c99 -S source.c . 为了查看结果,我使用gcc -std=c99 -S source.c编译了两个变体。 A quick glance at the resulting assembler outputs shows that the optimization shown above yields ca. 快速浏览一下所得到的汇编程序输出结果表明,上面显示的优化结果为ca. 1/3 viewer instructions, most of them inside the loop. 1/3查看器指令,其中大多数在循环内。

Variations on the same theme as @Aki answer. 与@Aki回答相同主题的变化。 Some of them are better here, but it may depend on your compiler and target machines (they should be more suitable for superscalar processor that Aki's function even if they do more work as there is less data dependencies) 其中一些在这里更好,但它可能取决于你的编译器和目标机器(它们应该更适合超级标量处理器,即Aki的功能,即使它们的工作量更少,因为数据依赖性较小)

// Aki Suuihkonen: 1.265
static uint64_t inflate_parallel1(unsigned char a) {
  uint64_t vector = a * 0x0101010101010101ULL;
  vector &= 0x8040201008040201;
  vector += 0x00406070787c7e7f;
  vector = (vector >> 7) & 0x0101010101010101ULL; 
  return vector * 255;
}

// By seizet and then combine: 1.583
static uint64_t inflate_parallel2(unsigned char a) {
    uint64_t vector1 = a * 0x0002000800200080ULL;
    uint64_t vector2 = a * 0x0000040010004001ULL;
    uint64_t vector = (vector1 & 0x0100010001000100ULL) | (vector2 & 0x0001000100010001ULL);
    return vector * 255;
}

// Stay in 32 bits as much as possible: 1.006
static uint64_t inflate_parallel3(unsigned char a) {
    uint32_t vector1 = (( (a & 0x0F)       * 0x00204081) & 0x01010101) * 255;
    uint32_t vector2 = ((((a & 0xF0) >> 4) * 0x00204081) & 0x01010101) * 255;
    return (((uint64_t)vector2) << 32) | vector1;
}

// Do the common computation in 64 bits: 0.915
static uint64_t inflate_parallel4(unsigned char a) {
    uint32_t vector1 =  (a & 0x0F)       * 0x00204081;
    uint32_t vector2 = ((a & 0xF0) >> 4) * 0x00204081;
    uint64_t vector = (vector1 | (((uint64_t)vector2) << 32)) & 0x0101010101010101ULL;
    return vector * 255;
}

// Some computation is done in 64 bits a little sooner: 0.806
static uint64_t inflate_parallel5(unsigned char a) {
    uint32_t vector1 = (a & 0x0F) * 0x00204081;
    uint64_t vector2 = (a & 0xF0) * 0x002040810000000ULL;
    uint64_t vector = (vector1 | vector2) & 0x0101010101010101ULL;
    return vector * 255;
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM