简体   繁体   English

如何将字符串与多个正则表达式匹配?

[英]How to match a string against multiple regex?

At the moment I am using the below filter to increment elements in arr, given a list of strings as an argument, is there an efficient way to do this in python.I got millions of such lists to validate upon. 目前我使用下面的过滤器来增加arr中的元素,给定一个字符串列表作为参数,是否有一种有效的方法在python中执行此操作。我有数百万个这样的列表要进行验证。

  def countbycat(tempfilter):
        arr=[0,0,0,0,0]
        apattern,dpattern,mpattern,upattern,tpattern = re.compile("^[a]--*"),re.compile("^[d]--*"),re.compile("^[m]--*"),re.compile("^[u]--*"),re.compile("^[t]--*")
        for each in tempfilter:
            if upattern.match(each):
                 arr[0]+=1
            elif mpattern.match(each):
                 arr[1]+=1
            elif dpattern.match(each):
                 arr[2]=1
            elif apattern.match(each):
                 arr[3]+=1
            elif tpattern.match(each):
                 arr[4]+=1
        return arr  

For the regular expressions given in the question, you can use following regular expression using character class: 对于问题中给出的正则表达式,您可以使用以下使用字符类的正则表达式:

[admut]-
  • [admut] will match any of a , d , m , u , t [admut]将匹配admut中的任何a
  • ^ can be omitted because re.match matches only at the beginning of the string. ^可以省略,因为re.match仅匹配字符串的开头。
  • removed -* because it's pointless; 删除-*因为它没有意义; only one - is enough to check - appear after the a/d/m/u/t . 只有一个-足以检查-出现在a/d/m/u/t

And instead of using array, you can use a dictionary; 而不是使用数组,你可以使用字典; no need to remember indexes: 无需记住索引:

def countbycat(tempfilter):
    count = dict.fromkeys('admut', 0)
    pattern = re.compile("[admut]-")
    for each in tempfilter:
        if pattern.match(each):
            count[each[0]] += 1
    return count

Instead of dict.fromkeys , you can use collections.Counter . 您可以使用collections.Counter而不是dict.fromkeys

Don't use regex for this. 不要使用正则表达式。 You are checking for a very specific, fixed condition. 您正在检查非常具体的固定条件。 Namely, each[1] == '-' and each[0] in 'admut' . 即, each[1] == '-'each[0] in 'admut' Both of these are much faster than regex. 这两者都是比正则表达式快得多 The later can also be used as a mapping. 后者也可以用作映射。

def countbycat(tempfilter):
  arr = [0, 0, 0, 0, 0]
  char_idx = {  # map admit to indices
    'u': 0,
    'm': 1,
    'd': 2,
    'a': 3,
    't': 4,
    }
  for each in tempfilter:
    if each[1] == '-':  # detect trailing -
      try:
        arr[char_idx[each[0]]] += 1  # increment position pointed to by admut
      except KeyError:  # each[0] not any of admut
        pass
  return arr  

In your simple case, go for falsetru's answer 在你的简单案例中,请找出falsetru的答案

In general case, you can combine your patterns into one regex (provided that your regexes doesn't contain capturing groups), and check which wart of regex matched: 一般情况下,您可以将模式组合成一个正则表达式(前提是您的正则表达式不包含捕获组),并检查正则表达式匹配的wart:

patterns = ["^[a]-+", "^[d]-+", "^[m]-+", "^[u]-+", "^[t]-+"]

complex_pattern = re.compile('|'.join(['(%s)' % i for i in patterns]))

# imperative way

arr = [0, 0, 0, 0, 0]

for each in tempfilter:
    match = complex_pattern.match(each)
    if match:
        arr[match.lastgroup + 1] += 1

return arr

# functional way

from collections import Counter

matches_or_none = (complex_pattern.match(each) for each in tempfilter)

return Counter(match.lastgroup + 1 for match in matches_or_none if match is not None)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM