使用python将文本与大量正则表达式进行比较的最快方法是什么？

Question

I have a list of regular expressions and I would like to match with tweets that as they arive so I can associate them with a specific account. 我有一个正则表达式列表，我希望与它们共享的推文匹配，以便我可以将它们与特定帐户相关联。 With a small number of rules as above it goes really fast, but as soon as you increase the amount of rules, it becomes slower and slower. 如上所述的规则很少，它的速度非常快，但只要增加规则数量，它就会越来越慢。

import string, re2, datetime, time, array

rules = [
    [[1],["(?!.*ipiranga).*((?=.*posto)(?=.*petrobras).*|(?=.*petrobras)).*"]],
    [[2],["(?!.*brasil).*((?=.*posto)(?=.*petrobras).*|(?=.*petrobras)).*"]],
]

#cache compile
compilled_rules = []
for rule in rules:
    compilled_scopes.append([[rule[0][0]],[re2.compile(rule[1][0])]])

def get_rules(text):
    new_tweet = string.lower(tweet)
    for rule in compilled_rules:
        ok = 1
        if not re2.search(rule[1][0], new_tweet): ok=0
        print ok

def test():
    t0=datetime.datetime.now()
    i=0
    time.sleep(1)
    while i<1000000:
        get_rules("Acabei de ir no posto petrobras. Moro pertinho do posto brasil")
        i+=1
        t1=datetime.datetime.now()-t0
        print "test"
        print i
        print t1
        print i/t1.seconds

When I have tested with 550 rules, I couldn't do more then 50 reqs/s. 当我使用550规则进行测试时，我无法做到超过50个reqs / s。 Is there a better way for doing this? 这样做有更好的方法吗？ I need at least 200 reqs/s 我需要至少200 reqs / s

EDIT: after tips from Jonathan I could improve about speed 5 times just but nesting a bit my rules. 编辑：在Jonathan的提示后，我可以提高速度5倍，但是我的规则有点嵌套。 See the code below: 请参阅以下代码：

scope_rules = {
    "1": {
        "termo 1" : "^(?!.*brasil)(?=.*petrobras).*",
        "termo 2" : "^(?!.*petrobras)(?=.*ipiranga).*",
        "termo 3" : "^(?!.*petrobras)(?=.*ipiranga).*",
        "termo 4" : "^(?!.*petrobras)(?=.*ipiranga).*",
        },
    "2": {
        "termo 1" : "^(?!.*ipiranga)(?=.*petrobras).*",
        "termo 2" : "^(?!.*petrobras)(?=.*ipiranga).*",
        "termo 3" : "^(?!.*brasil)(?=.*ipiranga).*",
        "termo 4" : "^(?!.*petrobras)(?=.*ipiranga).*",
        }
    }
compilled_rules = {}
for scope,rules in scope_rules.iteritems():
    compilled_rules[scope]={}
    for term,rule in rules.iteritems():
        compilled_rules[scope][term] = re.compile(rule)


def get_rules(text):
    new_tweet = string.lower(text)
    for scope,rules in compilled_rules.iteritems():
        ok = 1
        for term,rule in rules.iteritems():
            if ok==1:
                if re.search(rule, new_tweet):
                    ok=0
                    print "found in scope" + scope + " term:"+ term


def test():
    t0=datetime.datetime.now()
    i=0
    time.sleep(1)
    while i<1000000:
        get_rules("Acabei de ir no posto petrobras. Moro pertinho do posto ipiranga da lagoa")
        i+=1
        t1=datetime.datetime.now()-t0
        print "test"
        print i
        print t1
        print i/t1.seconds

cProfile.run('test()', 'testproof')

Answer 1

Your rules appear to be the culprits here: Because of the two .* , separated by lookaheads, a very high number of permutations has to be checked for a successful match (or to exclude a match). 您的规则似乎是这里的罪魁祸首：由于两个.* ，由前瞻分隔，必须检查非常多的排列才能成功匹配（或排除匹配）。 This is further compounded by your using re.search() without anchors. 您使用没有锚点的re.search()会进一步加剧这种情况。 Also, the alternation including the posto part is superfluous - the regex matches whether or not there's any posto in your string, so you might as well drop that completely. 此外，包括posto部分的交替是多余的 - 正则表达式匹配你的字符串中是否有任何posto ，所以你不妨完全放弃它。

For example, your first rule can be rewritten as 例如，您的第一条规则可以重写为

^(?!.*ipiranga)(?=.*petrobras)

without any change in results. 没有任何结果变化。 You can further optimize it with word boundaries, if you're looking for exact words: 如果您正在寻找确切的单词，您可以使用单词边界进一步优化它：

^(?!.*\bipiranga\b)(?=.*\petrobras\b)

Some measurements (using RegexBuddy ): 一些测量（使用RegexBuddy ）：

Your first regex, applied to the string Acabei de ir no posto petrobras. Moro pertinho do posto brasil 你的第一个正则表达式，适用于字符串Acabei de ir no posto petrobras. Moro pertinho do posto brasil Acabei de ir no posto petrobras. Moro pertinho do posto brasil takes the regex engine about 4700 steps to figure out a match. Acabei de ir no posto petrobras. Moro pertinho do posto brasil采用正则表达式引擎大约4700步来找出一场比赛。 If I take out the s in petrobras , it takes over 100.000 steps to determine a non-match. 如果我在petrobras取出s ，则需要超过100.000步才能确定不匹配。

Mine matches in 230 steps (and fails in 260), so you get a 20-400 times speed-up just from constructing the regex correctly. 我在230步中匹配（并且在260中失败），因此只需正确构建正则表达式就能获得20-400倍的加速。

Answer 2

Going even further I created a Cython extension to evaluate the rules and now it's blazing fast. 更进一步，我创建了一个Cython扩展来评估规则，现在它的速度非常快。 I can do about 70 requests per second with about 3000 regex rules 我可以使用约3000个正则表达式规则每秒执行大约70个请求

regex.pyx regex.pyx

import re2
import string as pystring

cpdef list match_rules(char *pytext, dict compilled_rules):
    cdef int ok, scope, term
    cdef list response = []
    text = pystring.lower(pytext)
    for scope, rules in compilled_rules.iteritems():
        ok = 1
        for term,rule in rules.iteritems():
            if ok==1:
                if re2.search(rule, text):
                    ok=0
                    response.append([scope,term])
    return response

python code python代码

import re2 as re
import datetime, time, cProfile

scope_rules = {1: {1 : "^(?!.*brasil)(?=.*petrobras).*", 2: "^(?!.*petrobras)(?=.*ipiranga).*",3 : "^(?!.*petrobras)(?=.*ipiranga).*",4 : "^(?!.*petrobras)(?=.*ipiranga).*",},2: {1 : "^(?!.*brasil)(?=.*petrobras).*", 2: "^(?!.*petrobras)(?=.*ipiranga).*",3 : "^(?!.*petrobras)(?=.*ipiranga).*",4 : "^(?!.*petrobras)(?=.*ipiranga).*",},}

compilled_rules = {}
for scope,rules in scope_rules.iteritems():
    compilled_rules[scope]={}
    for term,rule in rules.iteritems():
        compilled_rules[scope][term] = re.compile(rule)

def test():
    t0=datetime.datetime.now()
    i=0
    time.sleep(1)
    while i<1000000:
        mregex.match_rules("Acabei de ir no posto petrobras. Moro pertinho do posto brasil",compilled_rules)
        i+=1
        t1=datetime.datetime.now()-t0
        print t1
        print i/t1.seconds

cProfile.run('test()', 'testproof')

Answer 3

Besides optimizing your regex patterns themselves (which will make a huge difference), you can try Google's RE2 - It's supposed to be faster than Python's standard regular expressions module. 除了优化你的正则表达式模式（这将产生巨大的差异），你可以尝试谷歌的RE2 - 它应该比Python的标准正则表达式模块更快。

It's done in C++, but there's PyRE2 , a Python wrapper for RE2 by Facebook :) 它是用C ++完成的，但有PyRE2 ，Facebook的RE2的Python包装器:)

PS Thanks to your question, I found a great read on regex matching! PS感谢您提出的问题，我发现有关正则表达式匹配的精彩内容！

Answer 4

in addition to @Tim Pietzcker 's recommendation 除了@Tim Pietzcker的推荐

if you have a lot of rules, it might make sense to try a tiered approach built off of smaller rules grouped by commonalities. 如果你有很多规则，那么尝试一种基于共性的小规则构建的分层方法可能是有意义的。

for example, both of your rules above match 'posto' and 'petrobras'. 例如，上面的两个规则都匹配'posto'和'petrobras'。 if grouped those regexes into a list together, and then qualified a dispatch to that list, you could avoid running lots of rules that would never apply. 如果将这些正则表达式组合在一起列表中，然后将调度限定为该列表，则可以避免运行大量永不适用的规则。

in pseducode.... 在pseducode ....

# a dict of qualifier rules with list of rules 
rules = {
    "+ petrobras" : [
        "- ipiranga"
        "- brasil"
        "? posto"
    ] ,
}

for rule_qualifier in rules:
   if regex( rule_qualifier , text ):
       for rule in rule_qualifier:
           if regex( rule , text ):
               yay

使用python将文本与大量正则表达式进行比较的最快方法是什么？

问题描述

4 个解决方案

解决方案1
4 2012-10-04 12:17:53

解决方案2
1 已采纳 2012-10-17 12:21:27

regex.pyx regex.pyx

python code python代码

解决方案3
0 2012-10-04 12:24:49

解决方案4
0 2012-10-04 15:24:32

使用python将文本与大量正则表达式进行比较的最快方法是什么？

问题描述

4 个解决方案

解决方案1 4 2012-10-04 12:17:53

解决方案2 1 已采纳 2012-10-17 12:21:27

regex.pyx regex.pyx

python code python代码

解决方案3 0 2012-10-04 12:24:49

解决方案4 0 2012-10-04 15:24:32

解决方案1
4 2012-10-04 12:17:53

解决方案2
1 已采纳 2012-10-17 12:21:27

解决方案3
0 2012-10-04 12:24:49

解决方案4
0 2012-10-04 15:24:32