python-正则表达式模式是大量项目的最佳实践？

Question

I'm looking for a pro tip here. 我正在这里寻找专家建议。 I have in a database a list of strings, things like "MD", "PHD", "MR" etc. various salutations. 我在数据库中有一个字符串列表，例如“ MD”，“ PHD”，“ MR”等各种称呼。 It's several hundred rows and I receive it in a specific order (MD is more important than MR). 它有几百行，我按特定顺序收到（MD比MR更重要）。 I also have a series of people objects that I'll be iterating and need a very efficient way of matching. 我还有一系列的人员对象，这些对象将进行迭代，并且需要一种非常有效的匹配方法。 I've tried two and maybe there isn't another method. 我尝试了两个，也许没有其他方法了。

My first try is when I receive the list, re.compile each one and put them into a list. 我的第一次尝试是当我收到列表时，重新编译每个列表并将它们放入列表中。 Then... 然后...

theregexlist = ["MR", "DR", "MRS" ... "MISS", "PHD"] #several hundred
personname = "MR JOEY SMITH" #other examples are similar like "BOBBY DR MD JOE"
for theregex in theregexlist:
    if re.search(theregex, personname):
        do stuffs....
        break #since my list is ordered, I only want the first match

Which does indeed work. 确实有效。 I also tried looping the regexlist and building a huge matching regex with capturing parans, re.compile it, and then: 我还尝试循环regexlist并构建一个巨大的匹配正则表达式，以捕获paran，对其进行重新编译，然后：

hugeregex = re.compile("(?:(MR)|(MR)|(PHD)| ...  |(DR)|(MD))")
personname = "FRED DR FLINTSTONE"
maybematch = re.search(hugeregex, personname)
if maybematch:
    print (maybematch.group(0))

Is there some kind of map, leverage keys, or iteration function that I'm just not thinking of that would be more efficient? 是否有某种我没有想到的映射，杠杆键或迭代函数会更有效？ Any and all idea are appreciated! 任何和所有的想法表示赞赏！ Even if it's "Yup, it's just gonna be slow, try to use timeit to see which is faster", then I can stop searching :) Thank you! 即使是“是的，它只会变慢，请尝试使用timeit来查看哪个更快”，然后我就可以停止搜索了：)谢谢！

Answer 1

The "big" RegEx with all "particules" (like "MR", "MS", etc.), will be more efficient because it will be compiled only once. 具有所有“粒子”（例如“ MR”，“ MS”等）的“大” RegEx会更高效，因为它只会被编译一次。 And you reduce function calls (which is an optimisation). 并且减少了函数调用（这是一种优化）。

If you have special characters inside a particule, you may need to escape them with re.escape . 如果微粒中包含特殊字符，则可能需要使用re.escape对其进行re.escape 。

You can compile the RegEx and get a reference to the search method. 您可以编译RegEx并获取对search方法的引用。

Here is an example: 这是一个例子：

import re

particules = ["MR", "DR", "MRS", "MISS", "PHD"]

regex = r"\b(?:" + "|".join(map(re.escape, particules)) + r")\b"
search_any_particule = re.compile(regex, flags=re.IGNORECASE).search

personname = "FRED DR FLINTSTONE"

mo = search_any_particule(personname)
if mo:
    print(mo.group())

You get: 'DR'. 您得到：'DR'。

EDIT 编辑

The best way to make sure you implementation is efficient is to profile it. 确保实施高效的最佳方法是对其进行概要分析 。 For that, you can use cProfile library. 为此，您可以使用cProfile库。

For instance: 例如：

def find_particule(personname):
    mo = search_any_particule(personname)
    if mo:
        return mo.group()
    return None

import cProfile

cProfile.runctx('for i in range(1000000): find_particule("FRED DR FLINTSTONE")', globals(), locals())

The profiler will give you something like this: 分析器将为您提供以下信息：

         3000003 function calls in 2.110 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.353    0.353    2.110    2.110 <string>:1(<module>)
  1000000    0.495    0.000    1.757    0.000 python:10(find_particule)
        1    0.000    0.000    2.110    2.110 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
  1000000    0.185    0.000    0.185    0.000 {method 'group' of '_sre.SRE_Match' objects}
  1000000    1.078    0.000    1.078    0.000 {method 'search' of '_sre.SRE_Pattern' objects}

python-正则表达式模式是大量项目的最佳实践？

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-05-18 15:43:31

python-正则表达式模式是大量项目的最佳实践？

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-05-18 15:43:31

解决方案1
2 已采纳 2017-05-18 15:43:31