简体   繁体   English

Python:定义正则表达式的并集

[英]Python: defining a union of regular expressions

I have a list of patterns like 我有一个类似的模式列表

list_patterns = [': error:', ': warning:', 'cc1plus:', 'undefine reference to']

what I want to do is to produce a union of all of them yielding a regular expression that matches every element in list_patterns [but presumably does not match any re not in list_patterns -- msw] 我想要做的是生成一个所有这些联合的联合,产生一个匹配list_patterns中的每个元素的正则表达式[但可能不匹配任何不在list_patterns中的任何内容 - msw]

re.compile(list_patterns)

Is this possible? 这可能吗?

There are a couple of ways of doing this. 有几种方法可以做到这一点。 The simplest is: 最简单的是:

list_patterns = [': error:', ': warning:', 'cc1plus:', 'undefine reference to']
string = 'there is an : error: and a cc1plus: in this string'
print re.findall('|'.join(list_patterns), string)

Output: 输出:

[': error:', 'cc1plus:']

which is fine as long as concatenating your search patterns doesn't break the regex (eg if one of them contains a regex special character like an opening parenthesis). 只要连接你的搜索模式不会打破正则表达式(例如,如果其中一个包含正则表达式特殊字符,如左括号),这是很好的。 You can handle that this way: 你可以这样处理:

list_patterns = [': error:', ': warning:', 'cc1plus:', 'undefine reference to']
string = 'there is an : error: and a cc1plus: in this string'
pattern = "|".join(re.escape(p) for p in list_patterns)
print re.findall(pattern, string)

Output is the same. 输出是一样的。 But what this does is pass each pattern through re.escape() to escape any regex special characters. 但这样做是通过re.escape()传递每个模式来转义任何正则表达式特殊字符。

Now which one you use depends on your list of patterns. 现在使用哪一个取决于您的模式列表。 Are they regular expressions and can thus be assumed to be valid? 它们是正则表达式,因此可以假设有效吗? If so, the first would probably be appropriate. 如果是这样,第一个可能是合适的。 If they are strings, use the second method. 如果它们是字符串,请使用第二种方法。

For the first, it gets more complicated however because by concatenating several regular expressions you may change the grouping and have other unintended side effects. 对于第一个,它变得更复杂,但是因为通过连接几个正则表达式,您可能会更改分组并具有其他意外的副作用。

list_regexs = [re.compile(x) for x in list_patterns]

You want a pattern that matches any item in the list? 您想要一个匹配列表中任何项目的模式吗? Wouldn't that just be: 那不就是:

': error:|: warning:|cc1plus:|undefine reference to'?

Or, in Python code: 或者,在Python代码中:

re.compile("|".join(list_patterns))

How about 怎么样

ptrn = re.compile('|'.join(re.escape(e) for e in list_patterns))

Note the use of re.escape() to avoid unintended consequences by presence of characters like ()[]|.+* etc in some of the strings. 注意使用re.escape()来避免因某些字符串中存在像()[] |。+ *等字符而产生的意外后果。 Assuming you want that, otherwise skip the escape() . 假设你想要那个,否则跳过escape()

It also depends how do you intend to 'consume' that expression - is it only for search of a match or would you like to collect the matching groups back? 这也取决于你打算如何'消费'那个表达 - 它只是为了搜索一个匹配还是你想收回匹配的组?

Cletus gave a very good answer. 克莱普斯给出了一个非常好的答案。 If however one of the strings to match could be a substring of another, then you would to reverse sort the strings first so that shortest matches do not obscure longer ones. 但是,如果要匹配的其中一个字符串可能是另一个字符串的子字符串,那么您可以先对字符串进行反向排序,以便最短匹配不会遮挡较长的字符串。

If, as Alex has noted, the original poster wanted what he actually asked for, then a more tractable solution than using permutations might be to: 如果,正如亚历克斯所说,原始海报想要他实际要求的东西,那么比使用排列更容易处理的解决方案可能是:

  • Remove any duplicates in list_patterns. 删除list_patterns中的所有重复项。 (It could be better off starting with a set then turning it into a reverse-sorted list without duplicates). (从一组开始然后把它变成一个没有重复的反向排序列表可能会更好)。
  • re.escape() the items of the list. re.escape()列表中的项目。
  • Surround each item in individually a group (... ) . 将每个项目单独围绕一组(... )
  • '|'.join() all the groups. '|'。join()所有组。
  • Find the set of the indices of all groups that matched, and compare its length with len(list_patterns) . 找到匹配的所有组的索引集,并将其长度与len(list_patterns)进行比较。

If there is at least one match for every entry in your original list of strings, then the length of the set should match. 如果原始字符串列表中的每个条目至少有一个匹配项,则该组的长度应匹配。

The code would be something like: 代码如下:

import re

def usedgroupindex(indexabledata):
    for i,datum in enumerate(indexabledata):
        if datum: return i
    # return None

def findallstrings(list_patterns, string):
    lp = sorted(set(list_patterns), reverse=True)
    pattern = "|".join("(%s)" % re.escape(p) for p in lp)
    # for m in re.findall(pattern, string): print (m, usedgroupindex(m))
    return ( len(set(usedgroupindex(m) for m in re.findall(pattern, string)))
             == len(lp) )

list_patterns = [': error:', ': warning:', 'cc1plus:', 'undefine reference to']
string = ' XZX '.join(list_patterns)

print ( findallstrings(list_patterns, string) )

a regular expression that matches every element in the list 一个匹配列表中每个元素的正则表达式

I see you've already got several answers based on the assumption that by "matches every element in the list" you actually meant "matches any element in the list" (the answers in questions are based on the | "or" operator of regular expressions). 我看你已经有了基于这样的假设,通过“列表中的每个元素相匹配”你实际上意味着“ 任何元素匹配列表”(在问题的答案是基于几个答案|经常“或”运算符表达式)。

If you actually do want a RE to match every element of the list (as opposed to any single such element), then you might want to match them either in the same order as the list gives them (easy), or, in any order whatever (hard). 如果你确实想要一个RE来匹配列表中的每个元素(而不是任何单个这样的元素),那么你可能想要按照列表给出的相同顺序(简单)匹配它们,或者,以任何顺序无论如何(艰难)。

For in-order matching, '.*?'.join(list_patterns) should serve you well (if the items are indeed to be taken as RE patterns - if they're to be taken as literal strings instead, '.*?'.join(re.escape(p) for p list_patterns) ). 对于有序匹配, '.*?'.join(list_patterns)应该很好地为你服务(如果项目确实被视为RE模式 - 如果它们被视为文字字符串, '.*?'.join(re.escape(p) for p list_patterns) )。

For any-order matching, regular expressions, per se, offer no direct support. 对于任何订单匹配,正则表达式本身不提供直接支持。 You could take all permutations of the list (eg with itertools.permutations ), join up each of them with '.*?' 您可以获取列表的所有排列(例如,使用itertools.permutations ),使用'.*?'将每个排列连接起来'.*?' , and join the whole with | ,并加入| -- but that can produce a terribly long RE pattern as a result, because the number of permutations of N items is N! - 但结果会产生一个非常长的RE模式,因为N项的排列数是N! ("N factorial" -- for example, for N equal 4, the permutations are 4 * 3 * 2 * 1 == 24 ). (“N阶乘” - 例如,对于N等于4,置换是4 * 3 * 2 * 1 == 24 )。 Performance may therefore easily suffer unless the number of items in the list is known to be very, very small. 因此,除非已知列表中的项目数量非常非常小,否则性能可能容易受到影响。

For a more general solution to the "match every item in arbitrary order" problem (if that's what you need), one with a performance and memory footprint that's still acceptable for decently large lengths of list, you need to give up on the target of making it all work with a single RE object, and inject some logic in the mix -- for example, make a list of RE objects with relist=[re.compile(p) for p in list_patterns] , and check "they all match string s , in any order" with all(r.search(s) for r in relist) or the like. 对于“按任意顺序匹配每个项目”问题(如果这就是您需要的)的更一般的解决方案,具有性能和内存占用的一个仍然可以接受相当大的列表长度,你需要放弃目标使用一个RE对象完成所有操作,并在混合中注入一些逻辑 - 例如,使用relist=[re.compile(p) for p in list_patterns]创建一个RE对象列表,并检查“它们是否匹配字符串s ,以任何顺序“with all(r.search(s) for r in relist) ”等。

Of course, if you need to make this latest approach work in a "duck-typing compatible way" with actual RE objects, that's not hard, eg, if all you need is a search method which returns a boolean result (since returning a "match object" would make no sense)...: 当然,如果你需要让这个最新的方法与实际的RE对象以“鸭子兼容的方式”工作,那并不难,例如,如果你只需要一个返回布尔结果的search方法(因为返回一个“匹配对象“没有意义”......:

class relike(object):
    def __init__(self, list_patterns):
        self.relist = [re.compile(p) for p in list_patterns]
    def search(self, s):
        return all(r.search(s) for r in relist)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM