简体   繁体   English

Python的正则表达式模式缓存如何工作?

[英]How does Python's regex pattern caching work?

From the Python docs for re.compile() : 来自re.compile()的Python文档:

Note The compiled versions of the most recent patterns passed to re.match(), re.search() or re.compile() are cached, so programs that use only a few regular expressions at a time needn't worry about compiling regular expressions. 注意传递给re.match(),re.search()或re.compile()的最新模式的编译版本被缓存,因此一次只使用几个正则表达式的程序不必担心定期编译表达式。

However, in my testing, this assertion doesn't seem to hold up. 但是,在我的测试中,这个断言似乎没有成功。 When timing the following snippets that use the same pattern repeatedly, the compiled version is still substantially faster than the uncompiled one (which should supposedly be cached). 在对重复使用相同模式的以下片段进行计时时,编译版本仍然比未编译版本(应该被缓存)快得多。

Is there something I am missing here that explains the time difference? 我在这里找不到能解释时差的东西吗?

import timeit

setup = """
import re
pattern = "p.a.t.t.e.r.n"
target = "p1a2t3t4e5r6n"
r = re.compile(pattern)
"""

print "compiled:", \
    min(timeit.Timer("r.search(target)", setup).repeat(3, 5000000))
print "uncompiled:", \
    min(timeit.Timer("re.search(pattern, target)", setup).repeat(3, 5000000))

Results: 结果:

compiled: 2.26673030059
uncompiled: 6.15612802627

Here's the (CPython) implementation of re.search : 这是re.search的(CPython)实现:

def search(pattern, string, flags=0):
    """Scan through string looking for a match to the pattern, returning
    a match object, or None if no match was found."""
    return _compile(pattern, flags).search(string)

and here is re.compile : 这里是re.compile

def compile(pattern, flags=0):
    "Compile a regular expression pattern, returning a pattern object."
    return _compile(pattern, flags)

which relies on re._compile : 它依赖于re._compile

def _compile(*key):
    # internal: compile pattern
    cachekey = (type(key[0]),) + key
    p = _cache.get(cachekey)            #_cache is a dict.   
    if p is not None:
        return p
    pattern, flags = key
    if isinstance(pattern, _pattern_type):
        if flags:
            raise ValueError('Cannot process flags argument with a compiled pattern')
        return pattern 
    if not sre_compile.isstring(pattern):
        raise TypeError, "first argument must be string or compiled pattern"
    try:
        p = sre_compile.compile(pattern, flags)
    except error, v:
        raise error, v # invalid expression
    if len(_cache) >= _MAXCACHE:
        _cache.clear()
    _cache[cachekey] = p
    return p

So you can see that as long as the regex is already in the dictionary, the only extra work involved is the lookup in the dictionary (which involves creating a few temporary tuples, a few extra function calls ...). 所以你可以看到,只要正则表达式已经在字典中,所涉及的唯一额外工作是字典中的查找(包括创建一些临时元组,一些额外的函数调用......)。

Update In the good ole' days (the code copied above), the cache used to be completely invalidated when it got too big. 更新在好的日子里(上面复制的代码),当缓存太大时,缓存曾经完全失效。 These days, the cache cycles -- dropping the oldest items first. 这些天,缓存周期 - 首先删除最旧的项目。 This implementation relies on the ordering of python dictionaries (which was an implementation detail until python3.7). 这个实现依赖于python词典的排序(这是python3.7之前的实现细节)。 In Cpython before python3.6, this would have dropped an arbitrary value out of the cache (which is arguably still better than invalidating the whole cache) 在python3.6之前的Cpython中,这会从缓存中删除一个任意值(这可能比使整个缓存无效更好)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM