简体   繁体   English

C ++快速除法/ mod乘10 ^ x

[英]C++ fast division/mod by 10^x

In my program I use a lot of integer division by 10^x and integer mod function of power 10. 在我的程序中,我使用了很多整数除以10 ^ x和整数mod函数10。

For example: 例如:

unsigned __int64 a = 12345;
a = a / 100;
....

or: 要么:

unsigned __int64 a = 12345;
a = a % 1000;
....

If I am going to use the right bit shift >> , then I will get mode of 2^x , which is not what I want. 如果我要使用正确的位移>> ,那么我将获得2^x模式,这不是我想要的。

Is there any way I can speed up my program in integer division and mod functions? 有什么办法可以加速整数除法和mod函数的程序吗?

Short Answer: NO 简答: 没有

Long Answer: NO. 答案:不。

Explanation: 说明:
The compiler is already optimizing statements like this for you. 编译器已经为您优化了这样的语句。
If there is a technique for implementing this quicker than an integer division then the compiler already knows about it and will apply it (assuming you turn on optimizations). 如果有一种技术可以比整数除法更快地实现它,那么编译器已经知道它并将应用它(假设你打开优化)。

If you provide the appropriate architecture flags as well then the compiler may even know about specific fast architecture specific assembles that will provide a nice trick for doing the operation otherwise it will apply the best trick for the generic architecture it was compiled for. 如果您提供适当的体系结构标志,那么编译器甚至可能知道特定的快速体系结构特定的组件,这将为执行操作提供一个很好的技巧,否则它将为其编译的通用体系结构应用最佳技巧。

In short the compiler will beat the human 99.9999999% of the time in any optimization trick (try it remember to add the optimization flag and architecture flags). 简而言之,编译器将在任何优化技巧中击败人类99.9999999%的时间(尝试记住添加优化标志和体系结构标志)。 So the best you can normally do is equal the compiler. 所以你通常做的最好的事情就是编译器。

If by some miracle you discover a method that has not already been found by the Assembly boffins that work closely with the backend compiler team. 如果通过一些奇迹,你会发现一个尚未找到的与后端编译器团队密切合作的程序集中的方法。 Then please let them know and the next version of the popular compilers will be updated with the 'unknown (google)' division by 10 optimization trick. 然后请让他们知道,下一版本的热门编译器将通过10个优化技巧更新为'未知(谷歌)'部门。

From http://www.hackersdelight.org/divcMore.pdf 来自http://www.hackersdelight.org/divcMore.pdf

unsigned divu10(unsigned n) {
unsigned q, r;
q = (n >> 1) + (n >> 2);
q = q + (q >> 4);
q = q + (q >> 8);
q = q + (q >> 16);
q = q >> 3;
r = n - q*10;
return q + ((r + 6) >> 4);

}

This is great for environments that lack any div operation and its only ~2x slower than native division on my i7 (optimizations off, naturally). 这对于缺少任何div操作的环境非常有用,并且它比我的i7上的原生分区慢2倍(自然优化)。

Here's a slightly faster version of the algorithm, though there are still some nasty rounding errors with negative numbers. 这是一个稍微快一点的算法版本,尽管仍有一些令人讨厌的舍入错误与负数。

static signed Div10(signed n)
{
    n = (n >> 1) + (n >> 2);
    n += n < 0 ? 9 : 2;
    n = n + (n >> 4);
    n = n + (n >> 8);
    n = n + (n >> 16);
    n = n >> 3;
    return n;
}

Since this method is for 32-bit integer precision, you can optimize away most of these shifts if you're working in an 8-bit or 16-bit environment. 由于此方法适用于32位整数精度,因此如果您在8位或16位环境中工作,则可以优化大多数这些移位。

On a different note instead, it might make more sense to just write a proper version of Div#n# in assembler. 换句话说,在汇编程序中编写正确版本的Div#n#可能更有意义。 Compilers can't always predict the end result as efficiently (though, in most cases, they do it rather well). 编译器不能总是有效地预测最终结果(尽管在大多数情况下,他们做得相当好)。 So if you're running in a low-level microchip environment, consider a hand written asm routine. 因此,如果您在低级微芯片环境中运行,请考虑手写asm例程。

#define BitWise_Div10(result, n) {      \
    /*;n = (n >> 1) + (n >> 2);*/           \
    __asm   mov     ecx,eax                 \
    __asm   mov     ecx, dword ptr[n]       \
    __asm   sar     eax,1                   \
    __asm   sar     ecx,2                   \
    __asm   add     ecx,eax                 \
    /*;n += n < 0 ? 9 : 2;*/                \
    __asm   xor     eax,eax                 \
    __asm   setns   al                      \
    __asm   dec     eax                     \
    __asm   and     eax,7                   \
    __asm   add     eax,2                   \
    __asm   add     ecx,eax                 \
    /*;n = n + (n >> 4);*/                  \
    __asm   mov     eax,ecx                 \
    __asm   sar     eax,4                   \
    __asm   add     ecx,eax                 \
    /*;n = n + (n >> 8);*/                  \
    __asm   mov     eax,ecx                 \
    __asm   sar     eax,8                   \
    __asm   add     ecx,eax                 \
    /*;n = n + (n >> 16);*/                 \
    __asm   mov     eax,ecx                 \
    __asm   sar     eax,10h                 \
    __asm   add     eax,ecx                 \
    /*;return n >> 3;}*/                    \
    __asm   sar     eax,3                   \
    __asm   mov     dword ptr[result], eax  \
}

Usage: 用法:

int x = 12399;
int r;
BitWise_Div10(r, x); // r = x / 10
// r == 1239

Again, just a note. 再一次,只是一个注释。 This is better used on chips that indeed have really bad division. 这更适用于确实存在严重分裂的芯片。 On modern processors and modern compilers, divisions are often optimized out in very clever ways. 在现代处理器和现代编译器上,部门通常以非常聪明的方式进行优化。

You can also take a look at the libdivide project. 您还可以查看libdivide项目。 It is designed to speed-up the integer division, in the general case. 在一般情况下,它旨在加速整数除法。

除非你的架构支持二进制编码的十进制,否则只有大量的程序集混乱。

Short Answer: THAT DEPENDS. 简答:这取决于。

Long Answer: 答案很长:

Yes, it is very possible IF you can use things that the compiler cannot automatically deduce. 是的,如果您可以使用编译器无法自动推断的内容,则很有可能。 However, in my experience this is quite rare; 然而,根据我的经验,这是非常罕见的; most compilers are pretty good at vectorizing nowadays. 大多数编译器现在非常擅长矢量化。 However, much depends on how you model your data and how willing you are to create incredibly complex code. 但是,在很大程度上取决于您对数据建模的方式以及您是否愿意创建极其复杂的代码。 For most users, I wouldn't recommend going through the trouble in the first place. 对于大多数用户,我不建议首先解决问题。

To give you an example, here's the implementation of x / 10 where x is a signed integer (this is actually what the compiler will generate): 举个例子,这里是x / 10的实现,其中x是有符号整数(这实际上是编译器将生成的):

int eax = value * 0x66666667;
int edx = ([overflow from multiplication] >> 2); // NOTE: use aritmetic shift here!
int result = (edx >> 31) + edx;

If you disassemble your compiled C++ code, and you used a constant for the '10', it will show the assembly code reflecting the above. 如果您反汇编已编译的C ++代码,并且使用了常量'10',它将显示反映上述内容的汇编代码。 If you didn't use a constant, it'll generate a idiv , which is much slower. 如果你没有使用常量,它会产生一个idiv ,这要慢得多。

Knowing your memory is aligned cq knowing that your code can be vectorized, is something that can be very beneficial. 知道你的记忆已经对齐,知道你的代码可以被矢量化,这是非常有益的。 Do note that this does require you to store your data in such a way that this is possible. 请注意,这确实需要您以可能的方式存储数据。

For example, if you want to calculate the sum-of-div/10's of all integers, you can do something like this: 例如,如果要计算所有整数的sum-of-div / 10,可以执行以下操作:

    __m256i ctr = _mm256_set_epi32(0, 1, 2, 3, 4, 5, 6, 7);
    ctr = _mm256_add_epi32(_mm256_set1_epi32(INT32_MIN), ctr);

    __m256i sumdiv = _mm256_set1_epi32(0);
    const __m256i magic = _mm256_set1_epi32(0x66666667);
    const int shift = 2;

    // Show that this is correct:
    for (long long int i = INT32_MIN; i <= INT32_MAX; i += 8)
    {
        // Compute the overflow values
        __m256i ovf1 = _mm256_srli_epi64(_mm256_mul_epi32(ctr, magic), 32);
        __m256i ovf2 = _mm256_mul_epi32(_mm256_srli_epi64(ctr, 32), magic);

        // blend the overflows together again
        __m256i rem = _mm256_srai_epi32(_mm256_blend_epi32(ovf1, ovf2, 0xAA), shift);

        // calculate the div value
        __m256i div = _mm256_add_epi32(rem, _mm256_srli_epi32(rem, 31));

        // do something with the result; increment the counter
        sumdiv = _mm256_add_epi32(sumdiv, div);
        ctr = _mm256_add_epi32(ctr, _mm256_set1_epi32(8));
    }

    int sum = 0;
    for (int i = 0; i < 8; ++i) { sum += sumdiv.m256i_i32[i]; }
    std::cout << sum << std::endl;

If you benchmark both implementations, you will find that on an Intel Haswell processor, you'll get these results: 如果您对两种实现进行基准测试,您会发现在Intel Haswell处理器上,您将获得以下结果:

  • idiv: 1,4 GB/s idiv:1,4 GB / s
  • compiler optimized: 4 GB/s 编译器优化:4 GB / s
  • AVX2 instructions: 16 GB/s AVX2指令:16 GB / s

For other powers of 10 and unsigned division, I recommend reading the paper. 对于10的其他权力和未签名的分裂,我建议阅读本文。

In fact you don't need to do anything. 实际上你不需要做任何事情。 The compiler is smart enough to optimize multiplications/divisions with constants. 编译器足够智能,可以使用常量优化乘法/除法。 You can find many examples here 你可以在这里找到很多例子

You can even do a fast divide by 5 then shift right by 1 你甚至可以快速除以5然后向右移1

If the divisor is an explicit compile-time constant (ie if your x in 10^x is a compile-time constant), there's absolutely no point in using anything else than the language-provided / and % operators. 如果除数是一个显式的编译时常量(即如果你的x在10 ^ x是一个编译时常量),那么除了语言提供的/%运算符之外,使用其他任何东西绝对没有意义。 If there a meaningful way to speed them up for explicit powers of 10, any self-respecting compiler will know how to do that and will do that for you. 如果有一种有意义的方法可以加速显示10的显式幂,那么任何自尊的编译器都会知道如何做到这一点,并会为你做到这一点。

The only situation when you might think about a "custom" implementation (aside from a dumb compiler) is the situation when x is a run-time value. 当您考虑“自定义”实现(除了哑编译器)之外的唯一情况是x是运行时值。 In that case you'd need some kind of decimal-shift and decimal-and analogy. 在这种情况下,你需要某种十进制十进制和类比。 On a binary machine, a speedup is probably possible, but I doubt that you'll be able to achieve anything practically meaningful. 在二进制机器上,加速可能是可能的,但我怀疑你是否能够实现任何有意义的事情。 (If the numbers were stored in binary-decimal format, then it would be easy, but in "normal" cases - no.) (如果数字以二进制十进制格式存储,则很容易,但在“正常”情况下 - 不。)

If your runtime is genuinely dominated by 10 x -related operations, you could just use a base 10 integer representation in the first place. 如果您的运行时确实由10个x相关操作支配,那么您可以首先使用基数为10的整数表示。

In most situations, I'd expect the slowdown of all other integer operations (and reduced precision or potentially extra memory use) would count for more than the faster 10 x operations. 在大多数情况下,我预计所有其他整数操作的减速(以及降低的精度或可能额外的内存使用)将超过更快的10 x操作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM