简体   繁体   English

为什么在Python中捕获组时正则表达式搜索速度较慢?

[英]Why is regex search slower with capturing groups in Python?

I have an application code which generates regexes dynamically from a config for some parsing. 我有一个应用程序代码,可以从配置中动态生成正则表达式以进行一些解析。 When timing performance of two variations, the regex variation with each part of an OR regex being captured is noticably slow than a normal regex. 当两个变化的定时性能时,正被捕获的OR正则表达式的每个部分的正则表达式变化明显慢于正常正则表达式。 The reason would be overhead of certain operations internally in regex module. 原因是regex模块内部某些操作的开销。

>>> import timeit
>>> setup = '''
... import re
... '''   

#no capture group 
>>> print(timeit.timeit("re.search(r'hello|bye|ola|cheers','some say hello,some say bye, or ola or cheers!')", setup=setup))
0.922958850861

#with capture group
>>> print(timeit.timeit("re.search(r'(hello)|(bye)|(ola)|(cheers)','some say hello,some say bye, or ola or cheers!')", setup=setup))
1.44321084023

#no capture group
>>> print(timeit.timeit("re.search(r'hello|bye|ola|cheers','some say hello,some say bye, or ola or cheers!')", setup=setup))
0.913202047348

# capture group
>>> print(timeit.timeit("re.search(r'(hello)|(bye)|(ola)|(cheers)','some say hello,some say bye, or ola or cheers!')", setup=setup))
1.41544604301

Question: What causes this considerable drop in performance when using capture groups ? 问题:使用捕获组时,导致性能大幅下降的原因是什么?

The reason is pretty simple, using capturing groups indicate the Engine to save the content in memory, while using non capturing group indicates the engine to not save anything. 原因很简单,使用捕获组指示引擎将内容保存在内存中,而使用非捕获组指示引擎不保存任何内容。 Consider that you are telling the engine to perform more operations. 请考虑您告诉引擎执行更多操作。

For instance, using this regex (hello|bye|ola|cheers) or (hello)|(bye)|(ola)|(cheers) will impact considerably higher than using an atomic group or a non capturing one like (?:hello|bye|ola|cheers) . 例如,使用这个正则表达式(hello|bye|ola|cheers)(hello)|(bye)|(ola)|(cheers)会比使用原子组或非捕获的那样(?:hello|bye|ola|cheers) (hello)|(bye)|(ola)|(cheers)产生相当大的影响(?:hello|bye|ola|cheers)

When using regex you know if you want to capture or not capture content like the case above. 使用正则表达式时,您知道是否要捕获或不捕获内容,如上所述。 If you want to capture any of those words, you will lose performance but if you don't need to capture content then you can save performance by improving it like using non-capturing groups 如果你想捕获任何这些单词,你将失去性能,但如果你不需要捕获内容,那么你可以通过改进它来保存性能,就像使用非捕获组一样

I know you tagged python, but have have prepared an online benchmark for javascript to show how capturing and non-capturing groups impacts in the js regex engine. 我知道你标记了python,但是已经准备好了javascript的在线基准测试,以显示捕获和非捕获组如何影响js正则表达式引擎。

https://jsperf.com/capturing-groups-vs-non-capturing-groups https://jsperf.com/capturing-groups-vs-non-capturing-groups

在此输入图像描述

Your patterns only differ in the capturing groups. 您的模式仅在捕获组中有所不同。 When you define a capturing group in the regex pattern and use the pattern with re.search , the result will be a MatchObject instance. 在正则表达式模式中定义捕获组并将模式与re.search ,结果将是MatchObject实例。 Each match object will contain as many groups as there are capturing groups in the pattern, even if they are empty. 每个匹配对象将包含与模式中的捕获组一样多的 ,即使它们是空的。 That is the overhead for the re internals: adding the (list of) groups (memory allocation, etc.). 这是re内部的开销:添加(列表)组(内存分配等)。 Mind that groups also contain such details as the starting and ending index of the text that they match and more (refer to the MatchObject reference ). 请注意,组还包含诸如匹配的文本的起始和结束索引等详细信息(请参阅MatchObject参考 )。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM