简体   繁体   中英

Bitstream Optimizations

I have a program which reads a lot of data from a bitstream. My way of doing this doesn't seen to efficient because when doing performance tests, most of the time is spend in the read function. This is the my read function:

uint32_t bitstream::read(bitstream::size_type n) {
    uint32_t a = data[pos / 32];
    uint32_t b = data[(pos + n - 1) / 32];
    uint32_t shift = pos & 31;

    a >>= shift;
    b <<= 32 - shift;
    uint32_t mask = (uint32_t)(((uint64_t)1 << n) - 1);
    uint32_t ret = (a | b) & mask;

    pos += n;
    return ret;
}

How can I further optimize this? My profiler says most of the time of this function is spend on computing ret.

Edit:

Regarding internals, this is how I set data:

bitstream::bitstream(const std::string &dat) : size( dat.size()*8 ) {
    // data has the type std::vector<uint32_t>
    data.resize((dat.size() + 3) / 4 + 1);
    memcpy(&data[0], dat.c_str(), dat.size());
}

Are you always reading the same number of bits, or does it vary?

If you are, then you could try writing a function to read only that many bits: n being constant might allow the compiler to make some more aggressive optimisations. (And if n is always 1 then you could write a much simpler read method)

The answer mostly depends on the CPU architecture and compiler which you use, not on the language. If your CPU is < 32 bits or does a bad job on right shifting and/or the bit shifting subroutines of the compiler are naively implemented, you are out of luck in general. You could sacrifice vast amounts of program memory and write all cases explicitly (ie switch()-ing on the combination of pos modulo 32 vs. n) or you could try to do the compilers job by short-circuiting the shifts with uint16_t and uint8_t unions.

What you could do very cheaply in your code is to use a precomputed class const array for mask instead of calculating it every time in the function.

You could try keeping a buffer of 64 bits in an uint64_t, reading another 32-bit word as soon as it falls below 32 bits. This would probably help if you often read sizes much smaller than 32 bits.

If pos can be 0, then shift can also be 0. As a result b is shifted to the left by 32 bits, effectively setting it to 0 and the right shift of a by 0 has no effect as well. You should have an early termination for this case to avoid pointless operations.

Furthermore, you can try using a mask table to eliminate one shift operation, you need an uint32_t array of only 32 entries for that.

Most modern Intel CPUs have two ALU units, asking them to do three shfits in a row and then calculate result which is dependent on the outcome of those shifts by using even more ALU operations will limit your throughput.

Finally, if the code will be executed on CPUs with BMI capability you can use BEXTR instruction or intrinsic to extract len bits from the src starting at position start .

For more info about bit manipulation instructions see http://en.wikipedia.org/wiki/Bit_Manipulation_Instruction_Sets .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM