[英]bit hack vs conditional statement inside loop

I have a CRC calculation function that has the following in its inner loop: 我有一个CRC计算函数,其内部循环中包含以下内容:

if (uMsgByte & 0x80) crc ^= *pChkTableOffset; pChkTableOffset++;
if (uMsgByte & 0x40) crc ^= *pChkTableOffset; pChkTableOffset++;
if (uMsgByte & 0x20) crc ^= *pChkTableOffset; pChkTableOffset++;
if (uMsgByte & 0x10) crc ^= *pChkTableOffset; pChkTableOffset++;
if (uMsgByte & 0x08) crc ^= *pChkTableOffset; pChkTableOffset++;
if (uMsgByte & 0x04) crc ^= *pChkTableOffset; pChkTableOffset++;
if (uMsgByte & 0x02) crc ^= *pChkTableOffset; pChkTableOffset++;
if (uMsgByte & 0x01) crc ^= *pChkTableOffset; pChkTableOffset++;

Profiling has revealed that a lot of time is spent on these statements. 分析显示,这些语句花费了大量时间。 And I was wondering if I could get some gain by replacing the conditionals with 'bit hacks'. 我想知道是否可以通过将条件替换为“位hacks”来获得一些收益。 I tried the following, but got no speed improvement: 我尝试了以下操作,但没有提高速度:

crc ^= *pChkTableOffset++ & (!(uMsgByte & 0x80) - 1);
crc ^= *pChkTableOffset++ & (!(uMsgByte & 0x40) - 1);
crc ^= *pChkTableOffset++ & (!(uMsgByte & 0x20) - 1);
crc ^= *pChkTableOffset++ & (!(uMsgByte & 0x10) - 1);
crc ^= *pChkTableOffset++ & (!(uMsgByte & 0x08) - 1);
crc ^= *pChkTableOffset++ & (!(uMsgByte & 0x04) - 1);
crc ^= *pChkTableOffset++ & (!(uMsgByte & 0x02) - 1);
crc ^= *pChkTableOffset++ & (!(uMsgByte & 0x01) - 1);

Should this be faster on a recent x86 CPU or is there a better way to implement these 'bit hacks'? 在最近的x86 CPU上应该更快吗,还是有更好的方法来实现这些“位黑客”?

I can't say for sure which is FASTER, but they are definitely different - which is faster depends a lot on exactly which processor make and model is being used, since they behave differently on [presumably unpredictable] branches. 我不能肯定地说哪个更快,但是它们肯定是不同的-更快的速度取决于所使用的处理器品牌和型号,这在很大程度上取决于它们在[可能不可预测的]分支上的行为不同。 And to further complicate things, different processors have different behaviour for "dependent calculations". 而且,使事情更加复杂的是,不同的处理器对于“相关计算”具有不同的行为。

I converted the posted code into this (which makes the generated code about half as long, but otherwise identical at a conceptual level): 我将发布的代码转换为以下代码(这使生成的代码大约长一半,但在概念上相同):

int func1(int uMsgByte, char* pChkTableOffset)
    int crc = 0;
    if (uMsgByte & 0x80) crc ^= *pChkTableOffset; pChkTableOffset++;
    if (uMsgByte & 0x40) crc ^= *pChkTableOffset; pChkTableOffset++;
    if (uMsgByte & 0x20) crc ^= *pChkTableOffset; pChkTableOffset++;
    if (uMsgByte & 0x10) crc ^= *pChkTableOffset; pChkTableOffset++;

    return crc;

int func2(int uMsgByte, char* pChkTableOffset)
    int crc = 0;

    crc ^= *pChkTableOffset++ & (!(uMsgByte & 0x80) - 1);
    crc ^= *pChkTableOffset++ & (!(uMsgByte & 0x40) - 1);
    crc ^= *pChkTableOffset++ & (!(uMsgByte & 0x20) - 1);
    crc ^= *pChkTableOffset++ & (!(uMsgByte & 0x10) - 1);

    return crc;

And compiled with clang++ -S -O2 : 并用clang++ -S -O2编译:

func1: func1:

_Z5func1jPh:                            # @_Z5func1jPh
        xorl    %eax, %eax
        testb   %dil, %dil
        jns     .LBB0_2
        movzbl  (%rsi), %eax
.LBB0_2:                                # %if.end
        testb   $64, %dil
        je      .LBB0_4
        movzbl  1(%rsi), %ecx
        xorl    %ecx, %eax
.LBB0_4:                                # %if.end.6
        testb   $32, %dil
        je      .LBB0_6
        movzbl  2(%rsi), %ecx
        xorl    %ecx, %eax
.LBB0_6:                                # %if.end.13
        testb   $16, %dil
        je      .LBB0_8
        movzbl  3(%rsi), %ecx
        xorl    %ecx, %eax
.LBB0_8:                                # %if.end.20

func2: func2:

_Z5func2jPh:                            # @_Z5func2jPh
        movzbl  (%rsi), %eax
        movl    %edi, %ecx
        shll    $24, %ecx
        sarl    $31, %ecx
        andl    %eax, %ecx
        movzbl  1(%rsi), %eax
        movl    %edi, %edx
        shll    $25, %edx
        sarl    $31, %edx
        andl    %edx, %eax
        xorl    %ecx, %eax
        movzbl  2(%rsi), %ecx
        movl    %edi, %edx
        shll    $26, %edx
        sarl    $31, %edx
        andl    %ecx, %edx
        movzbl  3(%rsi), %ecx
        shll    $27, %edi
        sarl    $31, %edi
        andl    %ecx, %edi
        xorl    %edx, %edi
        xorl    %edi, %eax

As you can see, the compiler generates branches for the first version, and uses logical operations on the second version - a few more per case. 如您所见,编译器会为第一个版本生成分支,并在第二个版本上使用逻辑操作-每种情况下还要进行一些操作。

I could write some code to benchmark each of the loop, but I guarantee that the result will vary greatly between different versions of x86 processors. 我可以编写一些代码来对每个循环进行基准测试,但是我保证结果将在不同版本的x86处理器之间有很大差异。

I'm not sure if this is a common CRC calculation, but most CRC calculations have optimised versions that perform the right calculation in a faster way than this, using tables and other "clever stuff". 我不确定这是否是常见的CRC计算,但是大多数CRC计算都使用表和其他“聪明的东西”来优化版本,以比此更快的方式执行正确的计算。

Interested to see if a human could beat an optimising compiler, I wrote your algorithm in two ways: 有兴趣看一个人是否可以击败一个优化的编译器,我用两种方式编写了您的算法:

Here you express intent as if you were writing machine code 在这里,您表达的意图就像是在编写机器代码一样

std::uint32_t foo1(std::uint8_t uMsgByte, 
                   std::uint32_t crc, 
                   const std::uint32_t* pChkTableOffset)
    if (uMsgByte & 0x80) crc ^= *pChkTableOffset; pChkTableOffset++;
    if (uMsgByte & 0x40) crc ^= *pChkTableOffset; pChkTableOffset++;
    if (uMsgByte & 0x20) crc ^= *pChkTableOffset; pChkTableOffset++;
    if (uMsgByte & 0x10) crc ^= *pChkTableOffset; pChkTableOffset++;
    if (uMsgByte & 0x08) crc ^= *pChkTableOffset; pChkTableOffset++;
    if (uMsgByte & 0x04) crc ^= *pChkTableOffset; pChkTableOffset++;
    if (uMsgByte & 0x02) crc ^= *pChkTableOffset; pChkTableOffset++;
    if (uMsgByte & 0x01) crc ^= *pChkTableOffset; pChkTableOffset++;

    return crc;

Here I express intent in a more algorithmic way... 在这里,我以一种更加算法化的方式表达了意图。

std::uint32_t foo2(std::uint8_t uMsgByte, 
                   std::uint32_t crc, 
                   const std::uint32_t* pChkTableOffset)
    for (int i = 0 ; i < 7 ; ++i) {
        if (uMsgByte & (0x01 << (7-i)))
            crc ^= pChkTableOffset[i];

    return crc;

Then I compiled using g++ -O3 and the result was... 然后我使用g ++ -O3进行编译,结果是...

exactly the same object code in both functions 两个函数中的目标代码完全相同

Moral of the story: select the correct algorithm, avoid repetition, write elegant code and let the optimiser do its thing. 故事的寓意:选择正确的算法,避免重复,编写优美的代码,让优化者来做。

here's the proof: 这是证明:

__Z4foo1hjPKj:                          ## @_Z4foo1hjPKj
## BB#0:
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset %rbp, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register %rbp
    testb   $-128, %dil
    je  LBB0_2
## BB#1:
    xorl    (%rdx), %esi
    testb   $64, %dil
    je  LBB0_4
## BB#3:
    xorl    4(%rdx), %esi
    testb   $32, %dil
    je  LBB0_6
## BB#5:
    xorl    8(%rdx), %esi
    testb   $16, %dil
    je  LBB0_8
## BB#7:
    xorl    12(%rdx), %esi
    testb   $8, %dil
    je  LBB0_10
## BB#9:
    xorl    16(%rdx), %esi
    testb   $4, %dil
    je  LBB0_12
## BB#11:
    xorl    20(%rdx), %esi
    testb   $2, %dil
    je  LBB0_14
## BB#13:
    xorl    24(%rdx), %esi
    testb   $1, %dil
    je  LBB0_16
## BB#15:
    xorl    28(%rdx), %esi
    movl    %esi, %eax
    popq    %rbp

    .globl  __Z4foo2hjPKj
    .align  4, 0x90
__Z4foo2hjPKj:                          ## @_Z4foo2hjPKj
## BB#0:
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset %rbp, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register %rbp
    testb   $-128, %dil
    je  LBB1_2
## BB#1:
    xorl    (%rdx), %esi
    testb   $64, %dil
    je  LBB1_4
## BB#3:
    xorl    4(%rdx), %esi
    testb   $32, %dil
    je  LBB1_6
## BB#5:
    xorl    8(%rdx), %esi
    testb   $16, %dil
    je  LBB1_8
## BB#7:
    xorl    12(%rdx), %esi
    testb   $8, %dil
    je  LBB1_10
## BB#9:
    xorl    16(%rdx), %esi
    testb   $4, %dil
    je  LBB1_12
## BB#11:
    xorl    20(%rdx), %esi
    testb   $2, %dil
    je  LBB1_14
## BB#13:
    xorl    24(%rdx), %esi
    movl    %esi, %eax
    popq    %rbp

It would be interesting to see if the compiler also performs so well with the version of the code that uses logical operations rather than conditional statements. 有趣的是,看看编译器在使用逻辑运算而非条件语句的代码版本中是否也表现出色。

given: 给出:

std::uint32_t logical1(std::uint8_t uMsgByte, 
                       std::uint32_t crc, 
                       const std::uint32_t* pChkTableOffset)
    crc ^= *pChkTableOffset++ & (!(uMsgByte & 0x80) - 1);
    crc ^= *pChkTableOffset++ & (!(uMsgByte & 0x40) - 1);
    crc ^= *pChkTableOffset++ & (!(uMsgByte & 0x20) - 1);
    crc ^= *pChkTableOffset++ & (!(uMsgByte & 0x10) - 1);
    crc ^= *pChkTableOffset++ & (!(uMsgByte & 0x8) - 1);
    crc ^= *pChkTableOffset++ & (!(uMsgByte & 0x4) - 1);
    crc ^= *pChkTableOffset++ & (!(uMsgByte & 0x2) - 1);
    crc ^= *pChkTableOffset++ & (!(uMsgByte & 0x1) - 1);

    return crc;

the resulting machine code is: 生成的机器代码为:

8 lots of: 8个批次:

    movl    %edi, %eax     ; get uMsgByte into eax
    shll    $24, %eax      ; shift it left 24 bits so that bit 7 is in the sign bit
    sarl    $31, %eax      ; arithmetic shift right to copy the sign bit into all other bits
    andl    (%rdx), %eax   ; and the result with the value from the table
    xorl    %esi, %eax     ; exclusive-or into crc

so the short answer is yes - it performs very well (eliding the redundant increments of pChkTableOffset) 因此,简短的答案是肯定的-它执行得很好(消除了pChkTableOffset的冗余增量)

Is it faster? 它更快吗? who knows. 谁知道。 Probably not measurably - the number of memory fetches is the same in both cases. 可能无法测量-两种情况下的内存读取次数相同。 The compiler can work out whether it's better to avoid branches or not much better than you can (depending on the architecture the compiler is optimising for). 编译器可以计算出避免分支更好还是比您更好(取决于编译器针对其优化的体系结构)。

Is it more elegant and readable? 它更优雅,更易读吗? For myself, no. 对于我自己,不。 It's the kind of code I used to write when: 这是我在以下情况下曾经编写的代码:

  • c was still a young language c还是一门年轻的语言
  • processors were simple enough that I could do a better job of optimising 处理器非常简单,我可以更好地进行优化
  • processors were so slow that I had to 处理器太慢了,我不得不

None of these apply any more. 这些都不适用。

If this checksum is indeed a CRC, there is a much more efficient way to implement it. 如果此校验和确实是CRC,则有一种更有效的实现方法。

Assuming it's a CRC16: 假设它是CRC16:

Header: 标头:

class CRC16
    CRC16(const unsigned short poly);
    unsigned short CalcCRC(unsigned char * pbuf, int len);

    unsigned short CRCTab[256];
    unsigned long SwapBits(unsigned long swap, int bits);

Implementation: 实现方式:

CRC16::CRC16(const unsigned short poly)
    for(int i = 0; i < 256; i++) {
        CRCTab[i] = SwapBits(i, 8) << 8;
        for(int j = 0; j < 8; j++)
            CRCTab[i] = (CRCTab[i] << 1) ^ ((CRCTab[i] & 0x8000) ? poly : 0);
        CRCTab[i] = SwapBits(CRCTab[i], 16);

unsigned long CRC16::SwapBits(unsigned long swap, int bits)
    unsigned long r = 0;
    for(int i = 0; i < bits; i++) {
        if(swap & 1) r |= 1 << (bits - i - 1);
        swap >>= 1;
    return r;

unsigned short CRC16::CalcCRC(unsigned char * pbuf, int len)
    unsigned short r = 0;
    while(len--) r = (r >> 8) ^ CRCTab[(r & 0xFF) ^ *(pbuf++)];
    return r;

As you can see, each byte of the message is used only once, instead of 8 times. 如您所见,消息的每个字节仅使用一次,而不是8次。

There is a similar implementation for CRC8. CRC8有类似的实现。

Out of interest, extending alain's excellent suggestion of precomputing the CRC table, it occurs to me that this class can be modified to take advantage of c++14's constexpr : 出于兴趣,扩展了alain预先计算CRC表的出色建议,在我看来可以修改此类以利用c ++ 14的constexpr

#include <iostream>
#include <utility>
#include <string>

class CRC16

    // the storage for the CRC table, to be computed at compile time
    unsigned short CRCTab[256];

    // private template-expanded constructor allows folded calls to SwapBits at compile time
    constexpr CRC16(const unsigned short poly, std::integer_sequence<std::size_t, Is...>)
    : CRCTab { SwapBits(Is, 8) << 8 ... }

    // swap bits at compile time
    static constexpr unsigned long SwapBits(unsigned long swap, int bits)
        unsigned long r = 0;
        for(int i = 0; i < bits; i++) {
            if(swap & 1) r |= 1 << (bits - i - 1);
            swap >>= 1;
        return r;


    // public constexpr defers to private template expansion...
    constexpr CRC16(const unsigned short poly)
    : CRC16(poly, std::make_index_sequence<256>())
        //... and then modifies the table - at compile time
        for(int i = 0; i < 256; i++) {
            for(int j = 0; j < 8; j++)
                CRCTab[i] = (CRCTab[i] << 1) ^ ((CRCTab[i] & 0x8000) ? poly : 0);
            CRCTab[i] = SwapBits(CRCTab[i], 16);

    // made const so that we can instantiate constexpr CRC16 objects
    unsigned short CalcCRC(const unsigned char * pbuf, int len) const
        unsigned short r = 0;
        while(len--) r = (r >> 8) ^ CRCTab[(r & 0xFF) ^ *(pbuf++)];
        return r;


int main()
    // create my constexpr CRC16 object at compile time
    constexpr CRC16 crctab(1234);

    // caclulate the CRC of something...
    using namespace std;
    auto s = "hello world"s;

    auto crc = crctab.CalcCRC(reinterpret_cast<const unsigned char*>(s.data()), s.size());

    cout << crc << endl;

    return 0;

Then the constructor of CRC16(1234) pleasingly boils down to this: 然后,CRC16(1234)的构造函数可以归结为:

    .short  0                       ## 0x0
    .short  9478                    ## 0x2506
    .short  18956                   ## 0x4a0c
    .short  28426                   ## 0x6f0a
    .short  601                     ## 0x259
    .short  10079                   ## 0x275f
    .short  18517                   ## 0x4855
    .short  27987                   ## 0x6d53
... etc.

and the calculation of the CRC of the entire string becomes this: 整个字符串的CRC计算如下:

        leaq    __ZZ4mainE6crctab(%rip), %rdi ; <- referencing const data :)
        movzwl  (%rdi,%rdx,2), %edx
        jmp     LBB0_8
        xorl    %edx, %edx
        jmp     LBB0_11
        xorl    %edx, %edx
LBB0_8:                                 ## %.lr.ph.i.preheader.split
        testl   %esi, %esi
        je      LBB0_11
## BB#9:
        leaq    __ZZ4mainE6crctab(%rip), %rsi
        .align  4, 0x90
LBB0_10:                                ## %.lr.ph.i
                                        ## =>This Inner Loop Header: Depth=1
        movzwl  %dx, %edi
        movzbl  %dh, %edx  # NOREX
        movzbl  %dil, %edi
        movzbl  (%rcx), %ebx
        xorq    %rdi, %rbx
        xorw    (%rsi,%rbx,2), %dx
        movzwl  %dx, %edi
        movzbl  %dh, %edx  # NOREX
        movzbl  %dil, %edi
        movzbl  1(%rcx), %ebx
        xorq    %rdi, %rbx
        xorw    (%rsi,%rbx,2), %dx
        addq    $2, %rcx
        addl    $-2, %eax
        jne     LBB0_10

