简体   繁体   English

从 Python 中编译的正则表达式中提取命名组正则表达式模式

[英]Extract named group regex pattern from a compiled regex in Python

I have a regex in Python that contains several named groups.我有一个包含多个命名组的 Python 正则表达式。 However, patterns that match one group can be missed if previous groups have matched because overlaps don't seem to be allowed.但是,如果之前的组匹配,则匹配一组的模式可能会被遗漏,因为似乎不允许重叠。 As an example:举个例子:

import re
myText = 'sgasgAAAaoasgosaegnsBBBausgisego'
myRegex = re.compile('(?P<short>(?:AAA))|(?P<long>(?:AAA.*BBB))')

x = re.findall(myRegex,myText)
print(x)

Produces the output:产生输出:

[('AAA', '')]

The 'long' group does not find a match because 'AAA' was used-up in finding a match for the preceding 'short' group. “长”组找不到匹配项,因为“AAA”在为前面的“短”组查找匹配项时已用完。

I've tried to find a method to allow overlapping but failed.我试图找到一种允许重叠但失败的方法。 As an alternative, I've been looking for a way to run each named group separately.作为替代方案,我一直在寻找一种方法来分别运行每个命名组。 Something like the following:类似于以下内容:

for g in myRegex.groupindex.keys():
    match = re.findall(***regex_for_named_group_g***,myText)

Is it possible to extract the regex for each named group?是否可以为每个命名组提取正则表达式?

Ultimately, I'd like to produce a dictionary output (or similar) like:最终,我想生成一个字典输出(或类似的),如:

{'short':'AAA',
 'long':'AAAaoasgosaegnsBBB'}

Any and all suggestions would be gratefully received.任何和所有建议将不胜感激。

There really doesn't appear to be a nicer way to do this, but here's a another approach, along the lines of this other answer but somewhat simpler.确实似乎没有更好的方法来做到这一点,但这是另一种方法,沿着另一个答案的路线但稍微简单一些。 It will work provided that a) your patterns will always formed as a series of named groups separated by pipes, and b) the named group patterns never contain named groups themselves.如果 a) 您的模式将始终形成为一系列由管道分隔的命名组,并且 b) 命名组模式从不包含命名组本身,则它将起作用。

The following would be my approach if you're interested in all matches of each pattern.如果您对每个模式的所有匹配项感兴趣,以下将是我的方法。 The argument to re.split looks for a literal pipe followed by the (?=< , the beginning of a named group. It compiles each subpattern and uses the groupindex attribute to extract the name. re.split的参数查找文字管道,后跟(?=< ,命名组的开头。它编译每个子模式并使用groupindex属性来提取名称。

def nameToMatches(pattern, string):
    result = dict()
    for subpattern in re.split('\|(?=\(\?P<)', pattern):
        rx = re.compile(subpattern)
        name = list(rx.groupindex)[0]
        result[name] = rx.findall(string)
    return result

With your given text and pattern, returns {'long': ['AAAaoasgosaegnsBBB'], 'short': ['AAA']} .使用给定的文本和模式,返回{'long': ['AAAaoasgosaegnsBBB'], 'short': ['AAA']} Patterns that don't match at all will have an empty list for their value.根本不匹配的模式将为其值提供一个空列表。

If you only want one match per pattern, you can make it a bit simpler still:如果你只想要一个模式匹配,你可以让它更简单一点:

def nameToMatch(pattern, string):
    result = dict()
    for subpattern in re.split('\|(?=\(\?P<)', pattern):
        match = re.search(subpattern, string)
        if match:
            result.update(match.groupdict())
    return result

This gives {'long': 'AAAaoasgosaegnsBBB', 'short': 'AAA'} for your givens.这为您提供了{'long': 'AAAaoasgosaegnsBBB', 'short': 'AAA'} If one of the named groups doesn't match at all, it will be absent from the dict.如果其中一个命名组根本不匹配,则字典中将不存在该组。

There didn't seem to be an obvious answer, so here's a hack.似乎没有明显的答案,所以这里有一个 hack。 It needs a bit of finessing but basically it splits the original regex into its component parts and runs each group regex separately on the original text.它需要一些技巧,但基本上它将原始正则表达式拆分为其组成部分,并在原始文本上分别运行每个组正则表达式。

import re

myTextStr = 'sgasgAAAaoasgosaegnsBBBausgisego'
myRegexStr = '(?P<short>(?:AAA))|(?P<long>(?:AAA.*BBB))'
myRegex = re.compile(myRegexStr)   # This is actually no longer needed

print("Full regex with multiple groups")
print(myRegexStr)

# Use a regex to split the original regex into separate regexes
# based on group names
mySplitGroupsRegexStr = '\(\?P<(\w+)>(\([\w\W]+?\))\)(?:\||\Z)'
mySplitGroupsRegex = re.compile(mySplitGroupsRegexStr)
mySepRegexesList = re.findall(mySplitGroupsRegex,myRegexStr)

print("\nList of separate regexes")
print(mySepRegexesList)

# Convert separate regexes to a dict with group name as key
# and regex as value
mySepRegexDict = {reg[0]:reg[1] for reg in mySepRegexesList}
print("\nDictionary of separate regexes with group names as keys")
print(mySepRegexDict)

# Step through each key and run the group regex on the original text.
# Results are stored in a dictionary with group name as key and
# extracted text as value.
myGroupRegexOutput = {}
for g,r in mySepRegexDict.items():
    m = re.findall(re.compile(r),myTextStr)
    myGroupRegexOutput[g] = m[0]

print("\nOutput of overlapping named group regexes")
print(myGroupRegexOutput)

The resulting output is:结果输出是:

Full regex with multiple groups
(?P<short>(?:AAA))|(?P<long>(?:AAA.*BBB))

List of separate regexes
[('short', '(?:AAA)'), ('long', '(?:AAA.*BBB)')]

Dictionary of separate regexes with group names as keys
{'short': '(?:AAA)', 'long': '(?:AAA.*BBB)'}

Output of overlapping named group regexes
{'short': 'AAA', 'long': 'AAAaoasgosaegnsBBB'}

This might be useful to someone somewhere.这可能对某个地方的某人有用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM