为什么在Python 3中未编译，重复使用的正则表达式如此之慢？

Question

When answering this question (and having read this answer to a similar question), I thought that I knew how Python caches regexes. 在回答这个问题时（并且已经阅读了类似问题的答案），我认为我知道Python如何缓存正则表达式。

But then I thought I'd test it, comparing two scenarios: 但后来我想我会测试它，比较两种情况：

a single compilation of a simple regex, then 10 applications of that compiled regex. 单个正则表达式的单个汇编，然后编译正则表达式的10个应用程序。
10 applications of an uncompiled regex (where I would have expected slightly worse performance because the regex would have to be compiled once, then cached, and then looked up in the cache 9 times). 10个未编译的正则表达式的应用程序（我希望性能稍差一些，因为正则表达式必须编译一次，然后缓存，然后在缓存中查找9次）。

However, the results were staggering (in Python 3.3): 然而，结果是惊人的（在Python 3.3中）：

>>> import timeit
>>> timeit.timeit(setup="import re", 
... stmt='r=re.compile(r"\w+")\nfor i in range(10):\n r.search("  jkdhf  ")')
18.547793477671938
>>> timeit.timeit(setup="import re", 
... stmt='for i in range(10):\n re.search(r"\w+","  jkdhf  ")')
106.47892003890324

That's over 5.7 times slower! 这慢了5.7倍！ In Python 2.7, there is still an increase by a factor of 2.5, which is also more than I would have expected. 在Python 2.7中，仍然增加了2.5倍，这也超出了我的预期。

Has caching of regexes changed between Python 2 and 3? 在Python 2和3之间是否更改了正则表达式的缓存？ The docs don't seem to suggest that. 文档似乎没有暗示这一点。

Answer 1

The code has changed. 代码已经改变。

In Python 2.7, the cache is a simple dictionary; 在Python 2.7中，缓存是一个简单的字典; if more than _MAXCACHE items are stored in it, the whole the cache is cleared before storing a new item. 如果存储了多个_MAXCACHE项目，则在存储新项目之前清除整个缓存。 A cache lookup only takes building a simple key and testing the dictionary, see the 2.7 implementation of _compile() 缓存查找只需要构建一个简单的密钥并测试字典，请参阅_compile()的2.7实现

In Python 3.x, the cache has been replaced by the @functools.lru_cache(maxsize=500, typed=True) decorator . 在Python 3.x中，缓存已被@functools.lru_cache(maxsize=500, typed=True)装饰器取代。 This decorator does much more work and includes a thread-lock, adjusting the cache LRU queue and maintaining the cache statistics (accessible via re._compile.cache_info() ). 这个装饰器做了很多工作，包括一个线程锁，调整缓存LRU队列和维护缓存统计信息（可通过re._compile.cache_info()访问）。 See the 3.3.0 implementation of _compile() and of functools.lru_cache() . 请参阅_compile()和functools.lru_cache()的3.3.0实现。

Others have noticed the same slowdown, and filed issue 16389 in the Python bugtracker. 其他人注意到同样的减速，并在Python bugtracker中提交了16389问题。 I'd expect 3.4 to be a lot faster again; 我希望3.4再快一点; either the lru_cache implementation is improved or the re module will move to a custom cache again. lru_cache实现得到改进，或者re模块将再次移动到自定义缓存。

Update: With revision 4b4dddd670d0 (hg) / 0f606a6 (git) the cache change has been reverted back to the simple version found in 3.1. 更新：使用版本4b4dddd670d0（hg）/ 0f606a6（git），缓存更改已恢复为3.1中的简单版本。 Python versions 3.2.4 and 3.3.1 include that revision. Python版本3.2.4和3.3.1包含该版本。

Since then, in Python 3.7 the pattern cache was updated to a custom FIFO cache implementation based on a regular dict (relying on insertion order, and unlike a LRU, does not take into account how recently items already in the cache were used when evicting). 从那时起，在Python 3.7中，模式缓存被更新为基于常规dict的自定义FIFO缓存实现（依赖于插入顺序，与LRU不同，不考虑最近在驱逐时缓存中已有的项目）。

为什么在Python 3中未编译，重复使用的正则表达式如此之慢？

问题描述

1 个解决方案

解决方案1
26 已采纳 2013-02-07 17:16:59

为什么在Python 3中未编译，重复使用的正则表达式如此之慢？

问题描述

1 个解决方案

解决方案1 26 已采纳 2013-02-07 17:16:59

解决方案1
26 已采纳 2013-02-07 17:16:59