简体   繁体   English

在Python中编译正则表达式

[英]Compiling Regular Expressions in Python

I'm working through Doug Hellman's "The Python Standard Library by Example" and came across this: 我正在研究Doug Hellman的“示例中的Python标准库”并且遇到了这个:

"1.3.2 Compiling Expressions re includes module-level functions for working with regular expressions as text strings, but it is more efficient to compile the expressions a program uses frequently." “1.3.2编译表达式包括用于将正则表达式作为文本字符串处理的模块级函数,但编译程序经常使用的表达式更有效。”

I couldn't follow his explanation for why this is the case. 我不能按照他的解释为什么会这样。 He says that the "module-level functions maintain a cache of compiled expressions" and that since the "size of the cache" is limited, "using compiled expressions directly avoids the cache lookup overhead." 他说“模块级函数维护编译表达式的缓存”,并且由于“缓存大小”有限,“使用编译表达式直接避免了缓存查找开销。”

I'd greatly appreciate it if someone could please explain or direct me to an explanation that I could better understand for why it is more efficient to compile the regular expressions a program uses frequently, and how this process actually works. 如果有人可以解释或指导我解释一下,我可以更好地理解为什么编译程序经常使用的正则表达式更有效,以及这个过程实际如何工作,我将非常感激。

Hm. 嗯。 This is strange. 这很奇怪。 My knowledge so far (gained, among other source, from this question ) suggested my initial answer: 到目前为止,我的知识( 从这个问题获得了其他来源)提出了我的初步答案:


First answer 第一个答案

Python caches the last 100 regexes that you used, so even if you don't compile them explicitly, they don't have to be recompiled at every use. Python缓存您使用的最后100个正则表达式,因此即使您没有显式编译它们,也不必在每次使用时重新编译它们。

However, there are two drawbacks: When the limit of 100 regexes is reached, the entire cache is nuked, so if you use 101 different regexes in a row, each one will be recompiled every time. 但是,有两个缺点:当达到100个正则表达式的限制时,整个缓存都会被破坏,因此如果连续使用101个不同的正则表达式,每次都会重新编译每个正则表达式。 Well, that's rather unlikely, but still. 嗯,这不太可能,但仍然。

Second, in order to find out if a regex has been compiled already, the interpreter needs to look up the regex in the cache every time which does take a little extra time (but not much since dictionary lookups are very fast). 其次,为了找出是否已经编译了正则表达式,解释器需要每次都在缓存中查找正则表达式,这需要花费一些额外的时间(但由于字典查找速度非常快)。

So, if you explicitly compile your regexes, you avoid this extra lookup step. 因此,如果您显式编译正则表达式,则可以避免这个额外的查找步骤。


Update 更新

I just did some testing (Python 3.3): 我刚做了一些测试(Python 3.3):

>>> import timeit
>>> timeit.timeit(setup="import re", stmt='''r=re.compile(r"\w+")\nfor i in range(10):\n r.search("  jkdhf  ")''')
18.547793477671938
>>> timeit.timeit(setup="import re", stmt='''for i in range(10):\n re.search(r"\w+","  jkdhf  ")''')
106.47892003890324

So it would appear that no caching is being done. 所以似乎没有进行缓存。 Perhaps that's a quirk of the special conditions under which timeit.timeit() runs? 也许这是timeit.timeit()运行的特殊条件的怪癖?

On the other hand, in Python 2.7, the difference is not as noticeable: 另一方面,在Python 2.7中,差异并不明显:

>>> import timeit
>>> timeit.timeit(setup="import re", stmt='''r=re.compile(r"\w+")\nfor i in range(10):\n r.search("  jkdhf  ")''')
7.248294908492429
>>> timeit.timeit(setup="import re", stmt='''for i in range(10):\n re.search(r"\w+","  jkdhf  ")''')
18.26713670282241

I believe what he is trying to say is that you shouldn't compile your regex inside your loop, but outside it. 我相信他想说的是你不应该在你的循环中编译你的正则表达式,而是在它之外。 You can then just run the already compiled code inside the loop. 然后,您可以在循环内运行已编译的代码。

instead of: 代替:

while true: 
    result = re.match('A', str)

You should put: 你应该把:

regex = re.compile('A')
while true:
    result = regex.match(str)

Basically re.match(pattern, str) combines the compilation and matching step. 基本上 re.match(pattern, str)结合了编译和匹配步骤。 Compiling the same pattern inside the loop is inefficient, and so should be hoisted outside of the loop. 在循环内编译相同的模式是低效的,因此应该在循环之外提升。

See Tim's answer for the correct reasoning. 请参阅Tim对正确推理的回答。

It sounds to me like the author is simply saying it's more efficient to compile a regex and save that than to count on a previously compiled version of it still being held in the module's limited-size internal cache. 听起来像是作者只是说编译正则表达式更有效并保存,而不是指望它仍然保存在模块的有限大小的内部缓存中的先前编译版本。 This is probably because to the amount of effort it takes to compile them plus the extra cache lookup overhead that must first occur being greater than the client simply storing them itself. 这可能是因为编译它们所需的工作量加上必须首先发生的额外缓存查找开销大于客户端只是简单地存储它们本身。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM