简体   繁体   English

我应该使用什么算法进行高性能大整数除法?

[英]What algorithm should I use for high-performance large integer division?

I am encoding large integers into an array of size_t . 我将大整数编码为size_t数组。 I already have the other operations working (add, subtract, multiply); 我已经有其他操作工作(加,减,乘); as well as division by a single digit. 以及一位数的划分。 But I would like match the time complexity of my multiplication algorithms if possible (currently Toom-Cook). 但是如果可能的话,我想匹配我的乘法算法的时间复杂度(目前Toom-Cook)。

I gather there are linear time algorithms for taking various notions of multiplicative inverse of my dividend. 我收集有线性时间算法,用于采用我的红利的乘法逆的各种概念。 This means I could theoretically achieve division in the same time complexity as my multiplication, because the linear-time operation is "insignificant" by comparison anyway. 这意味着我理论上可以在与乘法相同的时间复杂度中实现除法,因为无论如何,线性时间操作通过比较是“无关紧要的”。

My question is, how do I actually do that? 我的问题是,我该怎么做呢? What type of multiplicative inverse is best in practice? 什么类型的乘法逆在实践中最好? Modulo 64^digitcount ? Modulo 64^digitcount When I multiply the multiplicative inverse by my divisor, can I shirk computing the part of the data that would be thrown away due to integer truncation? 当我将乘法逆乘以我的除数时,我可以推卸计算由于整数截断而丢弃的数据部分吗? Can anyone provide C or C++ pseudocode or give a precise explanation of how this should be done? 任何人都可以提供C或C ++伪代码或准确解释应该如何做到这一点?

Or is there a dedicated division algorithm that is even better than the inverse-based approach? 或者是否存在比基于逆的方法更好的专用除法算法?

Edit: I dug up where I was getting "inverse" approach mentioned above. 编辑:我挖出了上面提到的“反向”方法。 On page 312 of "Art of Computer Programming, Volume 2: Seminumerical Algorithms", Knuth provides "Algorithm R" which is a high-precision reciprocal. 在“Art of Computer Programming,Volume 2:Seminumerical Algorithms”的第312页上,Knuth提供了“算法R”,它是一种高精度的倒数。 He says its time complexity is less than that of multiplication. 他说它的时间复杂度小于乘法的时间复杂度。 It is, however, nontrivial to convert it to C and test it out, and unclear how much overhead memory, etc, will be consumed until I code this up, which would take a while. 然而,将它转换为C并测试它并且不清楚将消耗多少开销内存等,直到我对其进行编码,这将花费一些时间,这是非常重要的。 I'll post it if no one beats me to it. 如果没有人打败我,我会发布它。

The GMP library is usually a good reference for good algorithms. GMP库通常是良好算法的良好参考。 Their documented algorithms for division mainly depend on choosing a very large base, so that you're dividing a 4 digit number by a 2 digit number, and then proceed via long division. 他们记录的划分算法主要取决于选择一个非常大的基数,所以你将4位数除以2位数,然后通过长除法进行。

Long division will require computing 2 digit by 1 digit quotients; 长分区需要计算2位数乘1位数的商; this can either be done recursively, or by precomputing an inverse and estimating the quotient as you would with Barrett reduction. 这可以递归地完成,或者通过预计算逆并估计商,就像使用Barrett减少一样。

When dividing a 2n -bit number by an n -bit number, the recursive version costs O(M(n) log(n)) , where M(n) is the cost of multiplying n -bit numbers. 当将2n位数除以n位数时,递归版本花费O(M(n) log(n)) ,其中M(n)是乘以n位数的成本。

The version using Barrett reduction will cost O(M(n)) if you use Newton's algorithm to compute the inverse, but according to GMP's documentation, the hidden constant is a lot larger, so this method is only preferable for very large divisions. 如果使用牛顿算法计算逆,使用Barrett减少的版本将花费O(M(n)) ,但根据GMP的文档,隐藏常数要大得多,因此这种方法仅适用于非常大的划分。


In more detail, the core algorithm behind most division algorithms is an "estimated quotient with reduction" calculation, computing (q,r) so that 更详细地说,大多数除法算法背后的核心算法是“估计商与减少”计算,计算(q,r)以便

x = qy + r

but without the restriction that 0 <= r < y . 但没有0 <= r < y的限制。 The typical loop is 典型的循环是

  • Estimate the quotient q of x/y 估计x/y的商q
  • Compute the corresponding reduction r = x - qy 计算相应的减少r = x - qy
  • Optionally adjust the quotient so that the reduction r is in some desired interval 可选地调整商,使得减小r处于某个期望的间隔
  • If r is too big, then repeat with r in place of x . 如果r太大,则用r代替x重复。

The quotient of x/y will be the sum of all the q s produced, and the final value of r will be the true remainder. x/y的商是所有生成的q的总和, r的最终值将是真实的余数。

Schoolbook long division, for example, is of this form. 例如,教科书长期划分就是这种形式。 eg step 3 covers those cases where the digit you guessed was too big or too small, and you adjust it to get the right value. 例如,步骤3涵盖了您猜测的数字太大或太小的情况,并调整它以获得正确的值。

The divide and conquer approach estimates the quotient of x/y by computing x'/y' where x' and y' are the leading digits of x and y . 分而治之的方法通过计算x'/y'来估计x/y的商,其中x'y'xy的前导数字。 There is a lot of room for optimization by adjusting their sizes, but IIRC you get best results if x' is twice as many digits of y' . 通过调整大小可以有很大的优化空间,但如果x'y'两倍,IIRC会得到最好的结果。

The multiply-by-inverse approach is, IMO, the simplest if you stick to integer arithmetic. 如果你坚持使用整数运算,那么乘以逆的方法是最简单的IMO。 The basic method is 基本方法是

  • Estimate the inverse of y with m = floor(2^k / y) 估算y的倒数, m = floor(2^k / y)
  • Estimate x/y with q = 2^(i+jk) floor(floor(x / 2^i) m / 2^j) 估算x/yq = 2^(i+jk) floor(floor(x / 2^i) m / 2^j)

In fact, practical implementations can tolerate additional error in m if it means you can use a faster reciprocal implementation. 事实上,如果实际实现意味着您可以使用更快的互惠实现,那么实际实现可以容忍m额外错误。

The error is a pain to analyze, but if I recall the way to do it, you want to choose i and j so that x ~ 2^(i+j) due to how errors accumulate, and you want to choose x / 2^i ~ m^2 to minimize the overall work. 错误是分析的痛苦,但如果我记得这样做的方法,你想选择ij使得x ~ 2^(i+j)由于误差的积累,你想选择x / 2^i ~ m^2最小化整体工作。

The ensuing reduction will have r ~ max(x/m, y) , so that gives a rule of thumb for choosing k : you want the size of m to be about the number of bits of quotient you compute per iteration — or equivalently the number of bits you want to remove from x per iteration. 随后的减少将具有r ~ max(x/m, y) ,因此给出了选择k的经验法则:你希望m的大小大约是你每次迭代计算的商的位数 - 或者相当于每次迭代要从x删除的位数。

I do not know the multiplicative inverse algorithm but it sounds like modification of Montgomery Reduction or Barrett's Reduction. 我不知道乘法逆算法,但它听起来像蒙哥马利减少或巴雷特减少的修改。

I do bigint divisions a bit differently. 我做bigint分区有点不同。

See bignum division . bignum部门 Especially take a look at the approximation divider and the 2 links there. 特别是看一下近似分频器和那里的2个链路。 One is my fixed point divider and the others are fast multiplication algos (like karatsuba,Schönhage-Strassen on NTT) with measurements, and a link to my very fast NTT implementation for 32bit Base. 一个是我的定点分频器,其他是快速乘法算法(如NTT上的karatsuba,Schönhage-Strassen)和测量,以及我对32bit Base的快速NTT实现的链接。

I'm not sure if the inverse multiplicant is the way. 我不确定逆乘法器是否正确。

It is mostly used for modulo operation where the divider is constant. 它主要用于模运算,其中除法器是常量。 I'm afraid that for arbitrary divisions the time and operations needed to acquire bigint inverse can be bigger then the standard divisions itself, but as I am not familiar with it I could be wrong . 我担心,对于任意划分,获得bigint逆转所需的时间和操作可能比标准划分本身更大,但由于我不熟悉它我可能是错的

The most common divider in use I saw in implemetations are Newton–Raphson division which is very similar to approximation divider in the link above. 我在实现中看到的最常用的分频器是Newton-Raphson分区,它与上面链接中的近似分频器非常相似。

Approximation/iterative dividers usually use multiplication which define their speed. 近似/迭代分频器通常使用乘法来定义它们的速度。

For small enough numbers is usually long binary division and 32/64bit digit base division fast enough if not fastest: usually they have small overhead, and let n be the max value processed (not the number of digits!) 对于足够小的数字,通常是长二进制除法和32/64位数字基本除法,如果不是最快的话,它的速度足够快:通常它们的开销很小,并且n是处理的最大值(不是数字位数!)

Binary division example: 二进制除法示例:

Is O(log32(n).log2(n)) = O(log^2(n)) . O(log32(n).log2(n)) = O(log^2(n))
It loops through all significant bits. 它遍历所有有效位。 In each iteration you need to compare, sub, add, bitshift . 在每次迭代中,您需要compare, sub, add, bitshift Each of those operations can be done in log32(n) , and log2(n) is the number of bits. 这些操作中的每一个都可以在log32(n)log2(n)是位数。

Here example of binary division from one of my bigint templates (C++): 这里是我的一个bigint模板(C ++)的二进制除法示例:

template <DWORD N> void uint<N>::div(uint &c,uint &d,uint a,uint b)
    {
    int i,j,sh;
    sh=0; c=DWORD(0); d=1;
    sh=a.bits()-b.bits();
    if (sh<0) sh=0; else { b<<=sh; d<<=sh; }
    for (;;)
        {
        j=geq(a,b);
        if (j)
            {
            c+=d;
            sub(a,a,b);
            if (j==2) break;
            }
        if (!sh) break;
        b>>=1; d>>=1; sh--;
        }
    d=a;
    }

N is the number of 32 bit DWORD s used to store a bigint number. N是用于存储bigint数的32位DWORD的数量。

  • c = a / b
  • d = a % b
  • qeq(a,b) is a comparison: a >= b greater or equal (done in log32(n)=N ) qeq(a,b)是一个比较: a >= b大于或等于(在log32(n)=N
    It returns 0 for a < b , 1 for a > b , 2 for a == b 它返回0表示a < b1表示a > b2表示a == b
  • sub(c,a,b) is c = a - b sub(c,a,b)c = a - b

The speed boost is gained from that this does not use multiplication (if you do not count the bit shift) 从不使用乘法获得速度提升(如果不计算位移)

If you use digit with a big base like 2^32 (ALU blocks), then you can rewrite the whole in polynomial like style using 32bit build in ALU operations. 如果你使用像2 ^ 32(ALU块)这样的大基数的数字,那么你可以使用ALU操作中的32位构建以多项式样式重写整体。
This is usually even faster then binary long division, the idea is to process each DWORD as a single digit, or recursively divide the used arithmetic by half until hit the CPU capabilities. 这通常比二进制长除法更快,其想法是将每个DWORD处理为单个数字,或递归地将使用的算术除以一半直到达到CPU能力。
See division by half-bitwidth arithmetics 请参见半位宽算术分区

On top of all that while computing with bignums 最重要的是用bignums计算

If you have optimized basic operations, then the complexity can lower even further as sub-results get smaller with iterations (changing the complexity of basic operations) A nice example of that are NTT based multiplications. 如果你已经优化了基本操作,那么复杂性可以进一步降低,因为子结果随着迭代变小(改变基本操作的复杂性)一个很好的例子是基于NTT的乘法。

The overhead can mess thing up. 开销会使事情变得混乱。

Due to this the runtime sometimes does not copy the big O complexity, so you should always measure the tresholds and use faster approach for used bit-count to get the max performance and optimize what you can. 因此,运行时有时不会复制大的O复杂度,因此您应始终测量阈值并使用更快的方法来使用位数来获得最大性能并优化您的能力。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM