简体   繁体   English

如何针对大量整数优化C ++ / C代码

[英]how to optimize C++/C code for a large number of integers

I have written the below mentioned code. 我写了下面提到的代码。 The code checks the first bit of every byte. 代码检查每个字节的第一位。 If the first bit of every byte of is equal to 0, then it concatenates this value with the previous byte and stores it in a different variable var1. 如果每个字节的第一位等于0,则它将该值与前一个字节连接,并将其存储在不同的变量var1中。 Here pos points to bytes of an integer. 这里pos指向整数的字节。 An integer in my implementation is uint64_t and can occupy upto 8 bytes. 我的实现中的一个整数是uint64_t,最多可占用8个字节。

uint64_t func(char* data)
{
    uint64_t var1 = 0; int i=0;
    while ((data[i] >> 7) == 0) 
    {
        variable = (variable << 7) | (data[i]);
        i++;
    }   
   return variable; 
}

Since I am repeatedly calling func() a trillion times for trillions of integers. 因为我反复为func()调用数万亿次整数。 Therefore it runs slow, is there a way by which I may optimize this code? 因此它运行缓慢,有没有办法优化这段代码?

EDIT: Thanks to Joe Z..its indeed a form of uleb128 unpacking. 编辑:感谢Joe Z ..确实是一种uleb128拆包的形式。

I have only tested this minimally; 我只测试了这个最低限度; I am happy to fix glitches with it. 我很乐意用它修复故障。 With modern processors, you want to bias your code heavily toward easily predicted branches. 使用现代处理器,您希望将代码严重偏向容易预测的分支。 And, if you can safely read the next 10 bytes of input, there's nothing to be saved by guarding their reads by conditional branches. 而且,如果你可以安全地读取接下来的10个字节的输入,那么通过条件分支保护它们的读取就没有什么可以保存的。 That leads me to the following code: 这导致我得到以下代码:

// fast uleb128 decode
// assumes you can read all 10 bytes at *data safely.
// assumes standard uleb128 format, with LSB first, and 
// ... bit 7 indicating "more data in next byte"

uint64_t unpack( const uint8_t *const data )
{
    uint64_t value = ((data[0] & 0x7F   ) <<  0)
                   | ((data[1] & 0x7F   ) <<  7)
                   | ((data[2] & 0x7F   ) << 14)
                   | ((data[3] & 0x7F   ) << 21)
                   | ((data[4] & 0x7Full) << 28)
                   | ((data[5] & 0x7Full) << 35)
                   | ((data[6] & 0x7Full) << 42)
                   | ((data[7] & 0x7Full) << 49)
                   | ((data[8] & 0x7Full) << 56)
                   | ((data[9] & 0x7Full) << 63);

    if ((data[0] & 0x80) == 0) value &= 0x000000000000007Full; else
    if ((data[1] & 0x80) == 0) value &= 0x0000000000003FFFull; else
    if ((data[2] & 0x80) == 0) value &= 0x00000000001FFFFFull; else
    if ((data[3] & 0x80) == 0) value &= 0x000000000FFFFFFFull; else
    if ((data[4] & 0x80) == 0) value &= 0x00000007FFFFFFFFull; else
    if ((data[5] & 0x80) == 0) value &= 0x000003FFFFFFFFFFull; else
    if ((data[6] & 0x80) == 0) value &= 0x0001FFFFFFFFFFFFull; else
    if ((data[7] & 0x80) == 0) value &= 0x00FFFFFFFFFFFFFFull; else
    if ((data[8] & 0x80) == 0) value &= 0x7FFFFFFFFFFFFFFFull;

    return value;
}

The basic idea is that small values are common (and so most of the if-statements won't be reached), but assembling the 64-bit value that needs to be masked is something that can be efficiently pipelined. 基本思想是小值是常见的(因此大多数if语句都不会被访问),但是组装需要屏蔽的64位值是可以有效流水线化的。 With a good branch predictor, I think the above code should work pretty well. 有了一个很好的分支预测器,我认为上面的代码应该运行得很好。 You might also try removing the else keywords (without changing anything else) to see if that makes a difference. 您也可以尝试删除else关键字(不更改任何其他内容)以查看是否会产生影响。 Branch predictors are subtle beasts, and the exact character of your data also matters. 分支预测变量是微妙的动物,数据的确切特征也很重要。 If nothing else, you should be able to see that the else keywords are optional from a logic standpoint, and are there only to guide the compiler's code generation and provide an avenue for optimizing the hardware's branch predictor behavior. 如果不出意外,您应该能够从逻辑角度看到else关键字是可选的,并且仅用于指导编译器的代码生成并提供优化硬件分支预测器行为的途径。

Ultimately, whether or not this approach is effective depends on the distribution of your dataset. 最终,这种方法是否有效取决于数据集的分布。 If you try out this function, I would be interested to know how it turns out. 如果你尝试这个功能,我很想知道结果如何。 This particular function focuses on standard uleb128 , where the value gets sent LSB first, and bit 7 == 1 means that the data continues. 此特定功能侧重于标准uleb128 ,其中值首先发送LSB,位7 == 1表示数据继续。

There are SIMD approaches, but none of them lend themselves readily to 7-bit data. 有SIMD方法,但它们都不适合7位数据。

Also, if you can mark this inline in a header, then that may also help. 此外,如果您可以在标题中inline标记,那么这也可能有所帮助。 It all depends on how many places this gets called from, and whether those places are in a different source file. 这一切都取决于调用它的位数,以及这些位置是否在不同的源文件中。 In general, though, inlining when possible is highly recommended. 但是,一般情况下,强烈建议尽可能使用内联。

Your code is problematic 您的代码存在问题

uint64_t func(const unsigned char* pos)
{
    uint64_t var1 = 0; int i=0;
    while ((pos[i] >> 7) == 0) 
    {
        var1 = (var1 << 7) | (pos[i]);
        i++;
    }
    return var1;    
}

First a minor thing: i should be unsigned. 首先是一件小事: i应该是未签名的。

Second: You don't assert that you don't read beyond the boundary of pos . 第二:你没有断言你没有读过pos的边界。 Eg if all values of your pos array are 0 , then you will reach pos[size] where size is the size of the array, hence you invoke undefined behaviour. 例如,如果你的pos数组的所有值都是0 ,那么你将达到pos[size] ,其中size是数组的大小,因此你调用未定义的行为。 You should pass the size of your array to the function and check that i is smaller than this size. 您应该将数组的大小传递给函数,并检查i是否小于此大小。

Third: If pos[i] has most significant bit equal to zero for i=0,..,k with k>10 , then previous work get's discarded (as you push the old value out of var1 ). 第三:如果对于i=0,..,k ,当k>10pos[i]最高有效位等于零,则先前的工作被丢弃(当你将旧值推出var1 )。

The third point actually helps us: 第三点实际上有助于我们:

uint64_t func(const unsigned char* pos, size_t size)
{
    size_t i(0);
    while ( i < size && (pos[i] >> 7) == 0 )
    {
       ++i;
    }
    // At this point, i is either equal to size or
    // i is the index of the first pos value you don't want to use.
    // Therefore we want to use the values
    // pos[i-10], pos[i-9], ..., pos[i-1]
    // if i is less than 10, we obviously need to ignore some of the values
    const size_t start = (i >= 10) ? (i - 10) : 0;
    uint64_t var1 = 0;
    for ( size_t j(start); j < i; ++j )
    {
       var1 <<= 7;
       var1 += pos[j];
    }
    return var1; 
}

In conclusion: We separated logic and got rid of all discarded entries. 总之:我们分离了逻辑并摆脱了所有丢弃的条目。 The speed-up depends on the actual data you have. 加速取决于您拥有的实际数据。 If lot's of entries are discarded then you save a lot of writes to var1 with this approach. 如果丢弃了很多条目,那么使用这种方法可以节省大量写入var1

Another thing: Mostly, if one function is called massively, the best optimization you can do is call it less. 另一件事:大多数情况下,如果一个函数被大规模调用,那么你可以做的最好的优化是减少它。 Perhaps you can have come up with an additional condition that makes the call of this function useless. 也许你可以想出一个额外的条件,使这个函数的调用无用。

Keep in mind that if you actually use 10 values, the first value ends up the be truncated. 请记住,如果您实际使用10个值,则第一个值最终会被截断。

64bit means that there are 9 values with their full 7 bits of information are represented, leaving exactly one bit left foe the tenth. 64位意味着有9个值,它们的全部7位信息被表示,只留下一个位于第十位。 You might want to switch to uint128_t . 您可能想切换到uint128_t

A small optimization would be: 一个小的优化是:

while ((pos[i] & 0x80) == 0) 

Bitwise and is generally faster than a shift. 按位并且通常比移位快。 This of course depends on the platform, and it's also possible that the compiler will do this optimization itself. 这当然取决于平台,编译器也可能自己进行优化。

Can you change the encoding? 你能改变编码吗?

Google came across the same problem, and Jeff Dean describes a really cool solution on slide 55 of his presentation: 谷歌遇到了同样的问题,杰夫迪恩在他的演讲幻灯片55中描述了一个非常酷的解决方案:

The basic idea is that reading the first bit of several bytes is poorly supported on modern architectures. 基本思想是在现代架构中很难支持读取几个字节的第一位。 Instead, let's take 8 of these bits, and pack them as a single byte preceding the data. 相反,让我们取8个这些位,并将它们打包为数据之前的单个字节。 We then use the prefix byte to index into a 256-item lookup table, which holds masks describing how to extract numbers from the rest of the data. 然后,我们使用前缀字节索引到256项查找表,该表保存描述如何从其余数据中提取数字的掩码。

I believe it's how protocol buffers are currently encoded. 我相信协议缓冲区目前是如何编码的。

Can you change your encoding? 你能改变你的编码吗? As you've discovered, using a bit on each byte to indicate if there's another byte following really sucks for processing efficiency. 正如您所发现的那样,在每个字节上使用一个位来指示下一个字节是否真的很难提高处理效率。

A better way to do it is to model UTF-8, which encodes the length of the full int into the first byte: 更好的方法是建模UTF-8,它将full int的长度编码为第一个字节:

0xxxxxxx // one byte with 7 bits of data
10xxxxxx 10xxxxxx // two bytes with 12 bits of data
110xxxxx 10xxxxxx 10xxxxxx // three bytes with 16 bits of data
1110xxxx 10xxxxxx 10xxxxxx 10xxxxxx // four bytes with 22 bits of data
// etc.

But UTF-8 has special properties to make it easier to distinguish from ASCII. 但UTF-8具有特殊属性,可以更容易区分ASCII。 This bloats the data and you don't care about ASCII, so you'd modify it to look like this: 这会使数据膨胀并且您不关心ASCII,因此您将其修改为如下所示:

0xxxxxxx // one byte with 7 bits of data
10xxxxxx xxxxxxxx // two bytes with 14 bits of data.
110xxxxx xxxxxxxx xxxxxxxx // three bytes with 21 bits of data
1110xxxx xxxxxxxx xxxxxxxx xxxxxxxx // four bytes with 28 bits of data
// etc.

This has the same compression level as your method (up to 64 bits = 9 bytes), but is significantly easier for a CPU to process. 它具有与您的方法相同的压缩级别(最多64位= 9个字节),但CPU处理起来要容易得多。

From this you can build a lookup table for the first byte which gives you a mask and length: 从这里你可以为第一个字节构建一个查找表,它给你一个掩码和长度:

// byte_counts[255] contains the number of additional
// bytes if the first byte has a value of 255.
uint8_t const byte_counts[256]; // a global constant.

// byte_masks[255] contains a mask for the useful bits in
// the first byte, if the first byte has a value of 255.
uint8_t const byte_masks[256]; // a global constant.

And then to decode: 然后解码:

// the resulting value.
uint64_t v = 0;

// mask off the data bits in the first byte.
v = *data & byte_masks[*data];

// read in the rest.
switch(byte_counts[*data])
{
    case 3: v = v << 8 | *++data;
    case 2: v = v << 8 | *++data;
    case 1: v = v << 8 | *++data;
    case 0: return v;
    default:
        // If you're on VC++, this'll make it take one less branch.
        // Better make sure you've got all the valid inputs covered, though!
        __assume(0);
}

No matter the size of the integer, this hits only one branch point: the switch, which will likely be put into a jump table. 无论整数的大小如何,这只会触及一个分支点:开关,它可能会被放入跳转表中。 You can potentially optimize it even further for ILP by not letting each case fall through. 您可以通过不让每个案例失败来进一步优化ILP。

First, rather than shifting, you can do a bitwise test on the relevant bit. 首先,您可以对相关位进行逐位测试,而不是移位。 Second, you can use a pointer, rather than indexing (but the compiler should do this optimization itself. Thus: 其次,您可以使用指针而不是索引(但编译器应该自己进行此优化。因此:

uint64_t
readUnsignedVarLength( unsigned char const* pos )
{
    uint64_t results = 0;
    while ( (*pos & 0x80) == 0 ) {
        results = (results << 7) | *pos;
        ++ pos;
    }
    return results;
}

At least, this corresponds to what your code does. 至少,这与您的代码所做的相对应。 For variable length encoding of unsigned integers, it is incorrect, since 1) variable length encodings are little endian, and your code is big endian, and 2) your code doesn't or in the high order byte. 对于无符号整数的可变长度编码,它是不正确的,因为1)可变长度编码是小端,并且你的代码是大端,2)你的代码没有或在高位字节。 Finally, the Wiki page suggests that you've got the test inversed. 最后,Wiki页面表明您已经反转了测试。 (I know this format mainly from BER encoding and Google protocol buffers, both of which set bit 7 to indicate that another byte will follow. (我知道这种格式主要来自BER编码和谷歌协议缓冲区,两者都设置了第 7位,表示将跟随另一个字节。

The routine I use is: 我使用的例程是:

uint64_t
readUnsignedVarLen( unsigned char const* source )
{
    int shift = 0;
    uint64_t results = 0;
    uint8_t tmp = *source ++;
    while ( ( tmp & 0x80 ) != 0 ) {
        *value |= ( tmp & 0x7F ) << shift;
        shift += 7;
        tmp = *source ++;
    }
    return results | (tmp << shift);
}

For the rest, this wasn't written with performance in mind, but I doubt that you could do significantly better. 对于其他人来说,这并不是考虑到性能而写的,但我怀疑你能做得更好。 An alternative solution would be to pick up all of the bytes first, then process them in reverse order: 另一种解决方案是首先获取所有字节,然后以相反的顺序处理它们:

uint64_t
readUnsignedVarLen( unsigned char const* source )
{
    unsigned char buffer[10];
    unsigned char* p = std::begin( buffer );
    while ( p != std::end( buffer ) && (*source & 0x80) != 0 ) {
        *p = *source & 0x7F;
        ++ p;
    }
    assert( p != std::end( buffer ) );
    *p = *source;
    ++ p;
    uint64_t results = 0;
    while ( p != std::begin( buffer ) ) {
        -- p;
        results = (results << 7) + *p;
    }
    return results;
}

The necessity of checking for buffer overrun will likely make this slightly slower, but on some architectures, shifting by a constant is significantly faster than shifting by a variable, so this could be faster on them. 检查缓冲区溢出的必要性可能会略微降低,但在某些体系结构中,通过常量移位要比使用变量移位要快得多,因此这可能会更快。

Globally, however, don't expect miracles. 然而,在全球范围内,不要指望奇迹。 The motivation for using variable length integers is to reduce data size, at a cost in runtime for decoding and encoding . 使用可变长度整数的动机是减少数据大小, 在运行时用于解码和编码

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM