简体   繁体   中英

C++ fast division/mod by 10^x

In my program I use a lot of integer division by 10^x and integer mod function of power 10.

For example:

unsigned __int64 a = 12345;
a = a / 100;
....

or:

unsigned __int64 a = 12345;
a = a % 1000;
....

If I am going to use the right bit shift >> , then I will get mode of 2^x , which is not what I want.

Is there any way I can speed up my program in integer division and mod functions?

Short Answer: NO

Long Answer: NO.

Explanation:
The compiler is already optimizing statements like this for you.
If there is a technique for implementing this quicker than an integer division then the compiler already knows about it and will apply it (assuming you turn on optimizations).

If you provide the appropriate architecture flags as well then the compiler may even know about specific fast architecture specific assembles that will provide a nice trick for doing the operation otherwise it will apply the best trick for the generic architecture it was compiled for.

In short the compiler will beat the human 99.9999999% of the time in any optimization trick (try it remember to add the optimization flag and architecture flags). So the best you can normally do is equal the compiler.

If by some miracle you discover a method that has not already been found by the Assembly boffins that work closely with the backend compiler team. Then please let them know and the next version of the popular compilers will be updated with the 'unknown (google)' division by 10 optimization trick.

From http://www.hackersdelight.org/divcMore.pdf

unsigned divu10(unsigned n) {
unsigned q, r;
q = (n >> 1) + (n >> 2);
q = q + (q >> 4);
q = q + (q >> 8);
q = q + (q >> 16);
q = q >> 3;
r = n - q*10;
return q + ((r + 6) >> 4);

}

This is great for environments that lack any div operation and its only ~2x slower than native division on my i7 (optimizations off, naturally).

Here's a slightly faster version of the algorithm, though there are still some nasty rounding errors with negative numbers.

static signed Div10(signed n)
{
    n = (n >> 1) + (n >> 2);
    n += n < 0 ? 9 : 2;
    n = n + (n >> 4);
    n = n + (n >> 8);
    n = n + (n >> 16);
    n = n >> 3;
    return n;
}

Since this method is for 32-bit integer precision, you can optimize away most of these shifts if you're working in an 8-bit or 16-bit environment.

On a different note instead, it might make more sense to just write a proper version of Div#n# in assembler. Compilers can't always predict the end result as efficiently (though, in most cases, they do it rather well). So if you're running in a low-level microchip environment, consider a hand written asm routine.

#define BitWise_Div10(result, n) {      \
    /*;n = (n >> 1) + (n >> 2);*/           \
    __asm   mov     ecx,eax                 \
    __asm   mov     ecx, dword ptr[n]       \
    __asm   sar     eax,1                   \
    __asm   sar     ecx,2                   \
    __asm   add     ecx,eax                 \
    /*;n += n < 0 ? 9 : 2;*/                \
    __asm   xor     eax,eax                 \
    __asm   setns   al                      \
    __asm   dec     eax                     \
    __asm   and     eax,7                   \
    __asm   add     eax,2                   \
    __asm   add     ecx,eax                 \
    /*;n = n + (n >> 4);*/                  \
    __asm   mov     eax,ecx                 \
    __asm   sar     eax,4                   \
    __asm   add     ecx,eax                 \
    /*;n = n + (n >> 8);*/                  \
    __asm   mov     eax,ecx                 \
    __asm   sar     eax,8                   \
    __asm   add     ecx,eax                 \
    /*;n = n + (n >> 16);*/                 \
    __asm   mov     eax,ecx                 \
    __asm   sar     eax,10h                 \
    __asm   add     eax,ecx                 \
    /*;return n >> 3;}*/                    \
    __asm   sar     eax,3                   \
    __asm   mov     dword ptr[result], eax  \
}

Usage:

int x = 12399;
int r;
BitWise_Div10(r, x); // r = x / 10
// r == 1239

Again, just a note. This is better used on chips that indeed have really bad division. On modern processors and modern compilers, divisions are often optimized out in very clever ways.

You can also take a look at the libdivide project. It is designed to speed-up the integer division, in the general case.

除非你的架构支持二进制编码的十进制,否则只有大量的程序集混乱。

Short Answer: THAT DEPENDS.

Long Answer:

Yes, it is very possible IF you can use things that the compiler cannot automatically deduce. However, in my experience this is quite rare; most compilers are pretty good at vectorizing nowadays. However, much depends on how you model your data and how willing you are to create incredibly complex code. For most users, I wouldn't recommend going through the trouble in the first place.

To give you an example, here's the implementation of x / 10 where x is a signed integer (this is actually what the compiler will generate):

int eax = value * 0x66666667;
int edx = ([overflow from multiplication] >> 2); // NOTE: use aritmetic shift here!
int result = (edx >> 31) + edx;

If you disassemble your compiled C++ code, and you used a constant for the '10', it will show the assembly code reflecting the above. If you didn't use a constant, it'll generate a idiv , which is much slower.

Knowing your memory is aligned cq knowing that your code can be vectorized, is something that can be very beneficial. Do note that this does require you to store your data in such a way that this is possible.

For example, if you want to calculate the sum-of-div/10's of all integers, you can do something like this:

    __m256i ctr = _mm256_set_epi32(0, 1, 2, 3, 4, 5, 6, 7);
    ctr = _mm256_add_epi32(_mm256_set1_epi32(INT32_MIN), ctr);

    __m256i sumdiv = _mm256_set1_epi32(0);
    const __m256i magic = _mm256_set1_epi32(0x66666667);
    const int shift = 2;

    // Show that this is correct:
    for (long long int i = INT32_MIN; i <= INT32_MAX; i += 8)
    {
        // Compute the overflow values
        __m256i ovf1 = _mm256_srli_epi64(_mm256_mul_epi32(ctr, magic), 32);
        __m256i ovf2 = _mm256_mul_epi32(_mm256_srli_epi64(ctr, 32), magic);

        // blend the overflows together again
        __m256i rem = _mm256_srai_epi32(_mm256_blend_epi32(ovf1, ovf2, 0xAA), shift);

        // calculate the div value
        __m256i div = _mm256_add_epi32(rem, _mm256_srli_epi32(rem, 31));

        // do something with the result; increment the counter
        sumdiv = _mm256_add_epi32(sumdiv, div);
        ctr = _mm256_add_epi32(ctr, _mm256_set1_epi32(8));
    }

    int sum = 0;
    for (int i = 0; i < 8; ++i) { sum += sumdiv.m256i_i32[i]; }
    std::cout << sum << std::endl;

If you benchmark both implementations, you will find that on an Intel Haswell processor, you'll get these results:

  • idiv: 1,4 GB/s
  • compiler optimized: 4 GB/s
  • AVX2 instructions: 16 GB/s

For other powers of 10 and unsigned division, I recommend reading the paper.

In fact you don't need to do anything. The compiler is smart enough to optimize multiplications/divisions with constants. You can find many examples here

You can even do a fast divide by 5 then shift right by 1

If the divisor is an explicit compile-time constant (ie if your x in 10^x is a compile-time constant), there's absolutely no point in using anything else than the language-provided / and % operators. If there a meaningful way to speed them up for explicit powers of 10, any self-respecting compiler will know how to do that and will do that for you.

The only situation when you might think about a "custom" implementation (aside from a dumb compiler) is the situation when x is a run-time value. In that case you'd need some kind of decimal-shift and decimal-and analogy. On a binary machine, a speedup is probably possible, but I doubt that you'll be able to achieve anything practically meaningful. (If the numbers were stored in binary-decimal format, then it would be easy, but in "normal" cases - no.)

If your runtime is genuinely dominated by 10 x -related operations, you could just use a base 10 integer representation in the first place.

In most situations, I'd expect the slowdown of all other integer operations (and reduced precision or potentially extra memory use) would count for more than the faster 10 x operations.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM