加速 CPython 中的模运算

Question

This is a Park-Miller pseudo-random number generator:这是一个 Park-Miller 伪随机数生成器：

def gen1(a=783):
    while True:
        a = (a * 48271) % 0x7fffffff
        yield a

The 783 is just an arbitrary seed. 783只是一个任意的种子。 The 48271 is the coefficient recommended by Park and Miller in the original paper (PDF: Park, Stephen K.; Miller, Keith W. (1988). "Random Number Generators: Good Ones Are Hard To Find" ) 48271是 Park 和 Miller 在原始论文中推荐的系数（PDF： Park, Stephen K.；Miller, Keith W. (1988). “Random Number Generators: Good Ones are Hard To Find” ）

I would like to improve the performance of this LCG.我想提高这个LCG的性能。 The literature describes a way to avoid the division using bitwise tricks ( source ):文献描述了一种使用按位技巧（来源）避免除法的方法：

A prime modulus requires the computation of a double-width product and an explicit reduction step.素数模数需要计算双倍宽度乘积和显式缩减步骤。 If a modulus just less than a power of 2 is used (the Mersenne primes 2 ³¹ −1 and 2 ⁶¹ −1 are popular, as are 2 ³² −5 and 2 ⁶⁴ −59), reduction modulo m = 2 ^e − d can be implemented more cheaply than a general double-width division using the identity 2 ^e ≡ d (mod m).如果使用的模数刚好小于 2 的幂（梅森素数 2 ³¹ -1 和 2 ⁶¹ -1 很流行，2 ³² -5 和 2 ⁶⁴ -59 也是如此），减少模 m = 2 ^e - d 可以比使用恒等式 2 ^e ≡ d (mod m) 的一般双宽度除法更便宜。

Noting that the modulus 0x7fffffff is actually the Mersenne prime 2**32 - 1, here is the idea implemented in Python:注意模数0x7fffffff实际上是梅森素数 2**32 - 1，这是在 Python 中实现的想法：

def gen2(a=783):
    while True:
        a *= 48271
        a = (a & 0x7fffffff) + (a >> 31)
        a = (a & 0x7fffffff) + (a >> 31)
        yield a

Basic benchmark script:基本基准测试脚本：

import time, sys

g1 = gen1()
g2 = gen2()

for g in g1, g2:
    t0 = time.perf_counter()
    for i in range(int(sys.argv[1])): next(g)
    print(g.__name__, time.perf_counter() - t0)

The performance is improved in pypy (7.3.0 @ 3.6.9), for example generating 100 M terms:在 pypy (7.3.0 @ 3.6.9) 中性能得到了改进，例如生成 100 M 项：

$ pypy lcg.py 100000000
gen1 0.4366550260456279
gen2 0.3180829349439591

Unfortunately, the performance is actually degraded in CPython (3.9.0 / Linux):不幸的是，在 CPython (3.9.0 / Linux) 中性能实际上有所下降：

$ python3 lcg.py 100000000
gen1 20.650125587941147
gen2 26.844335232977755

My questions:我的问题：

Why is the bitwise arithmetic, usually touted as an optimization, actually even slower than a modulo operation in CPython?为什么通常被吹捧为优化的按位算术实际上比 CPython 中的模运算还要慢？
Can you improve the performance of this PRNG under CPython some other way, perhaps using numpy or ctypes ?您能否以其他方式在 CPython 下提高此 PRNG 的性能，也许使用 numpy 或ctypes ？

Note that arbitrary precision integers are not necessarily required here because this generator will never yield numbers longer than:请注意，此处不一定需要任意精度整数，因为此生成器永远不会产生长于：

>>> 0x7fffffff.bit_length()
31

Answer 1

My guess is, that in CPython-version the lion's share of time is spent for overhead (interpreter, dynamic dispatch) and not for the actual arithmetic operations.我的猜测是，在 CPython 版本中，大部分时间用于开销（解释器、动态调度）而不是实际的算术运算。 So adding more steps (ie more overhead) doesn't help much.所以增加更多的步骤（即更多的开销）并没有多大帮助。

The running times of PyPy looks more like what is needed for 10^8 modulo-operations with C-integers, so it probably able to use JIT, which doesn't have much overhead and thus we can see the speed-up of arithmetic operations. PyPy 的运行时间看起来更像是使用 C 整数进行 10^8 模运算所需的时间，因此它可能能够使用 JIT，它没有太多开销，因此我们可以看到算术运算的加速.

A possible way to reduce overhead is to use Cython ( here is an investigation of mine how Cython can help to reduce interpreter- and dispatch-overheads), and works out of the box for generators:减少开销的一种可能方法是使用 Cython（这是我对 Cython 如何帮助减少解释器和调度开销的调查），并且为生成器开箱即用：

%%cython
def gen_cy1(int a=783):
    while True:
        a = (a * 48271) % 0x7fffffff
        yield a
        
def gen_cy2(int a=783):
    while True:
        a *= 48271
        a = (a & 0x7fffffff) + (a >> 31)
        a = (a & 0x7fffffff) + (a >> 31)
        yield a

I use the following function for testing:我使用以下 function 进行测试：

def run(gen,N):
    for i in range(N): next(gen)

and tests show:和测试表明：

N=10**6
%timeit run(gen1(),N)   #  246 ms
%timeit run(gen2(),N)   #  387 ms
%timeit run(gen_cy1(),N)   # 114 ms
%timeit run(gen_cy2(),N)   # 107 ms

Both Cython versions are equally fast (and somewhat faster than the original), because having more operation, doesn't really costs more overhead, as arithmetical operations are done with C-int and no longer with Python-ints.两个 Cython 版本都同样快（并且比原始版本快一些），因为具有更多操作，实际上并不会花费更多开销，因为算术运算是使用 C-int 完成的，而不再使用 Python-ints。

However, if one really serious about getting the best performance - using a generator is a killer as it means a lot of overhead (see for example this SO-post ).但是，如果一个人真的很想获得最佳性能 - 使用生成器是一个杀手，因为这意味着很多开销（例如，参见这个SO-post ）。

Just to give a feeling, what could be possible if Python-generators aren't used - functions which generate all numbers (but don't convert them to Python-objects and thus without overhead):只是为了给人一种感觉，如果不使用 Python 生成器可能会发生什么 - 生成所有数字的函数（但不将它们转换为 Python 对象，因此没有开销）：

%%cython
def gen_last_cy1(int n, int a=783):
    cdef int i
    for i in range(n):
        a = (a * 48271) % 0x7fffffff
    return a

def gen_last_cy2(int n, int a=783):
    cdef int i
    for i in range(n):
        a *= 48271
        a = (a & 0x7fffffff) + (a >> 31)
        a = (a & 0x7fffffff) + (a >> 31)
    return a

lead to the following timings:导致以下时间：

N=10**6
%timeit gen_last_cy1(N)  # 7.21 ms
%timeit gen_last_cy2(N)  # 2.59 ms

That meas more than 90% of running time could be saved, if generator aren't used!这意味着如果不使用发电机，可以节省 90% 以上的运行时间！

I was slightly surprised, that the tweaked second version outperformed the original first.我有点惊讶，调整后的第二个版本优于原来的第一个版本。 Normally, C-compilers won't perform the modulo-operations directly but use bit-tricks themselves, if possible.通常，如果可能，C 编译器不会直接执行模运算，而是自己使用位技巧。 But here, C-compiler tricks are inferior at least on my maschine.但是在这里，至少在我的机器上，C 编译器的技巧是次要的。

The assembler (live on gotbold.org ) generated by gcc ( -O2 ) for the original version:由 gcc ( -O2 ) 为原始版本生成的汇编程序 (live on gotbold.org )：

        imull   $48271, %edi, %edi
        movslq  %edi, %rdx
        movq    %rdx, %rax
        salq    $30, %rax
        addq    %rdx, %rax
        movl    %edi, %edx
        sarl    $31, %edx
        sarq    $61, %rax
        subl    %edx, %eax
        movl    %eax, %edx
        sall    $31, %edx
        subl    %eax, %edx
        movl    %edi, %eax
        subl    %edx, %eax

as one can see, there is no div .可以看到，没有div 。

And here assembler for the second version (with much less operations):这里是第二个版本的汇编器（操作少得多）：

        imull   $48271, %edi, %eax
        movl    %eax, %edx
        sarl    $31, %eax
        andl    $2147483647, %edx
        addl    %edx, %eax
        movl    %eax, %edx
        sarl    $31, %eax
        andl    $2147483647, %edx
        addl    %edx, %eax

Clearly, less operations doesn't always mean faster code, but in this case it seems to be the case.显然，更少的操作并不总是意味着更快的代码，但在这种情况下似乎确实如此。

加速 CPython 中的模运算

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-01-27 05:14:13

加速 CPython 中的模运算

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-01-27 05:14:13

解决方案1
1 已采纳 2021-01-27 05:14:13