Python：并行编译正则表达式

Question

I have a program where I need to compile several thousand large regexes, all of which will be used many times.我有一个程序需要编译几千个大型正则表达式，所有这些都将被多次使用。 Problem is, it takes too long (according to cProfiler , 113 secs) to re.compile() them.问题是，重新cProfiler re.compile()它们需要很长时间（根据cProfiler ，113 秒）。 (BTW, actually searching using all of these regexes < 1.3 secs once compiled.) （顺便说一句，实际上在编译后使用所有这些正则表达式进行搜索 < 1.3 秒。）

If I don't precompile, it just postpones the problem to when I actually search, since re.search(expr, text) implicitly compiles expr .如果我不预编译，它只会将问题推迟到我实际搜索时，因为re.search(expr, text)隐式编译expr 。 Actually, it's worse, because re is going to recompile the entire list of regexes every time I use them.实际上，情况更糟，因为每次使用时re都会重新编译整个正则表达式列表。

I tried using multiprocessing , but that actually slows things down.我尝试使用multiprocessing ，但这实际上减慢了速度。 Here's a small test to demonstrate:这是一个小测试来演示：

## rgxparallel.py ##
import re
import multiprocessing as mp

def serial_compile(strings):
    return [re.compile(s) for s in strings]

def parallel_compile(strings):
    print("Using {} processors.".format(mp.cpu_count()))
    pool = mp.Pool()
    result = pool.map(re.compile, strings)
    pool.close()
    return result

l = map(str, xrange(100000))

And my test script:还有我的测试脚本：

#!/bin/sh
python -m timeit -n 1 -s "import rgxparallel as r" "r.serial_compile(r.l)"
python -m timeit -n 1 -s "import rgxparallel as r" "r.parallel_compile(r.l)"
# Output:
#   1 loops, best of 3: 6.49 sec per loop
#   Using 4 processors.
#   Using 4 processors.
#   Using 4 processors.
#   1 loops, best of 3: 9.81 sec per loop

I'm guessing that the parallel version is:我猜并行版本是：

In parallel, compiling and pickling the regexes, ~2 secs同时编译和酸洗正则表达式，约 2 秒
In serial, un-pickling, and therefore recompiling them all, ~6.5 secs在串行，un-pickling，因此重新编译它们，~6.5 秒

Together with the overhead for starting and stopping the processes, multiprocessing on 4 processors is more than 25% slower than serial.与开销用于启动和停止的方法中，一起multiprocessing 4个处理器是比串行慢25％以上。

I also tried divvying up the list of regexes into 4 sub-lists, and pool.map -ing the sublists, rather than the individual expressions.我还尝试将正则表达式列表分成 4 个子列表，并使用pool.map -ing 子列表，而不是单个表达式。 This gave a small performance boost, but I still couldn't get better than ~25% slower than serial.这带来了小的性能提升，但我仍然无法比串行慢约 25%。

Is there any way to compile faster than serial?有没有比串行编译更快的方法？

EDIT: Corrected the running time of the regex compilation.编辑：更正了正则表达式编译的运行时间。

I also tried using threading , but due to GIL, only one processor was used.我也尝试使用threading ，但由于 GIL，只使用了一个处理器。 It was slightly better than multiprocessing (130 secs vs. 136 secs), but still slower than serial (113 secs).它略好于multiprocessing （130 秒对 136 秒），但仍比串行（113 秒）慢。

EDIT 2: I realized that some regexes were likely to be duplicated, so I added a dict for caching them.编辑 2：我意识到一些正则表达式可能会重复，所以我添加了一个 dict 来缓存它们。 This shaved off ~30 sec.这减少了大约 30 秒。 I'm still interested in parallelizing, though.不过，我仍然对并行化感兴趣。 The target machine has 8 processors, which would reduce compilation time to ~15 secs.目标机器有 8 个处理器，这会将编译时间减少到约 15 秒。

Answer 1

There are lighter solutions than multiprocessing to get asynchronism of task execution, like threads and coroutines.有比多处理更轻松的解决方案来获得任务执行的异步性，例如线程和协程。 Though python2 is limited in its capability to run things simultaneously, python3 largely uses such asynchronous implementations within its fundamental types.尽管 python2 同时运行事物的能力有限，但 python3 在其基本类型中大量使用了这种异步实现。 Just run your code with python3 and you will see the difference:只需使用 python3 运行您的代码，您就会看到不同之处：

$ python2 --version
Python 2.7.17
$ python2 -m timeit -n 1 -s "import rgxparallel as r" "r.serial_compile(r.l)"
1 loops, best of 3: 3.65 sec per loop
$ python -m timeit -n 1 -s "import multire as r" "r.parallel_compile(r.l)"
1 loops, best of 3: 3.95 sec per loop

$ python3 --version
Python 3.6.9
$ python3 -m timeit -n 1 -s "import multire as r" "r.serial_compile(r.l)"
1 loops, best of 3: 0.72 usec per loop
$ python3 -m timeit -n 1 -s "import multire as r" "r.parallel_compile(r.l)"
...
1 loops, best of 3: 4.51 msec per loop

Do not forget to change the xrange by range for the python3 version.不要忘记为 python3 版本按range更改xrange 。

Answer 2

As much as I love python, I think the solution is, do it in perl (see this speed comparison, for example) , or C, etc.尽管我很喜欢 python，但我认为解决方案是用 perl （例如，参见速度比较）或 C 等来实现。

If you want to keep the main program in python, you could use subprocess to call a perl script (just make sure to pass as many values as possible in as few subprocess calls as possible to avoid overhead.如果您想将主程序保留在 python 中，您可以使用subprocess来调用 perl 脚本（只需确保在尽可能少的subprocess调用中传递尽可能多的值以避免开销。

Python：并行编译正则表达式

问题描述

2 个解决方案

解决方案1
0 2020-12-01 14:14:10

解决方案2
-1 2013-08-15 23:26:16

Python：并行编译正则表达式

问题描述

2 个解决方案

解决方案1 0 2020-12-01 14:14:10

解决方案2 -1 2013-08-15 23:26:16

解决方案1
0 2020-12-01 14:14:10

解决方案2
-1 2013-08-15 23:26:16