从 Python 中编译的正则表达式中提取命名组正则表达式模式

Question

我有一个包含多个命名组的 Python 正则表达式。 但是，如果之前的组匹配，则匹配一组的模式可能会被遗漏，因为似乎不允许重叠。 举个例子：

import re
myText = 'sgasgAAAaoasgosaegnsBBBausgisego'
myRegex = re.compile('(?P<short>(?:AAA))|(?P<long>(?:AAA.*BBB))')

x = re.findall(myRegex,myText)
print(x)

产生输出：

[('AAA', '')]

“长”组找不到匹配项，因为“AAA”在为前面的“短”组查找匹配项时已用完。

我试图找到一种允许重叠但失败的方法。 作为替代方案，我一直在寻找一种方法来分别运行每个命名组。 类似于以下内容：

for g in myRegex.groupindex.keys():
    match = re.findall(***regex_for_named_group_g***,myText)

是否可以为每个命名组提取正则表达式？

最终，我想生成一个字典输出（或类似的），如：

{'short':'AAA',
 'long':'AAAaoasgosaegnsBBB'}

任何和所有建议将不胜感激。

Answer 1

确实似乎没有更好的方法来做到这一点，但这是另一种方法，沿着另一个答案的路线但稍微简单一些。 如果 a) 您的模式将始终形成为一系列由管道分隔的命名组，并且 b) 命名组模式从不包含命名组本身，则它将起作用。

如果您对每个模式的所有匹配项感兴趣，以下将是我的方法。 re.split的参数查找文字管道，后跟(?=< ，命名组的开头。它编译每个子模式并使用groupindex属性来提取名称。

def nameToMatches(pattern, string):
    result = dict()
    for subpattern in re.split('\|(?=\(\?P<)', pattern):
        rx = re.compile(subpattern)
        name = list(rx.groupindex)[0]
        result[name] = rx.findall(string)
    return result

使用给定的文本和模式，返回{'long': ['AAAaoasgosaegnsBBB'], 'short': ['AAA']} 。 根本不匹配的模式将为其值提供一个空列表。

如果你只想要一个模式匹配，你可以让它更简单一点：

def nameToMatch(pattern, string):
    result = dict()
    for subpattern in re.split('\|(?=\(\?P<)', pattern):
        match = re.search(subpattern, string)
        if match:
            result.update(match.groupdict())
    return result

这为您提供了{'long': 'AAAaoasgosaegnsBBB', 'short': 'AAA'} 。 如果其中一个命名组根本不匹配，则字典中将不存在该组。

Answer 2

似乎没有明显的答案，所以这里有一个 hack。 它需要一些技巧，但基本上它将原始正则表达式拆分为其组成部分，并在原始文本上分别运行每个组正则表达式。

import re

myTextStr = 'sgasgAAAaoasgosaegnsBBBausgisego'
myRegexStr = '(?P<short>(?:AAA))|(?P<long>(?:AAA.*BBB))'
myRegex = re.compile(myRegexStr)   # This is actually no longer needed

print("Full regex with multiple groups")
print(myRegexStr)

# Use a regex to split the original regex into separate regexes
# based on group names
mySplitGroupsRegexStr = '\(\?P<(\w+)>(\([\w\W]+?\))\)(?:\||\Z)'
mySplitGroupsRegex = re.compile(mySplitGroupsRegexStr)
mySepRegexesList = re.findall(mySplitGroupsRegex,myRegexStr)

print("\nList of separate regexes")
print(mySepRegexesList)

# Convert separate regexes to a dict with group name as key
# and regex as value
mySepRegexDict = {reg[0]:reg[1] for reg in mySepRegexesList}
print("\nDictionary of separate regexes with group names as keys")
print(mySepRegexDict)

# Step through each key and run the group regex on the original text.
# Results are stored in a dictionary with group name as key and
# extracted text as value.
myGroupRegexOutput = {}
for g,r in mySepRegexDict.items():
    m = re.findall(re.compile(r),myTextStr)
    myGroupRegexOutput[g] = m[0]

print("\nOutput of overlapping named group regexes")
print(myGroupRegexOutput)

结果输出是：

Full regex with multiple groups
(?P<short>(?:AAA))|(?P<long>(?:AAA.*BBB))

List of separate regexes
[('short', '(?:AAA)'), ('long', '(?:AAA.*BBB)')]

Dictionary of separate regexes with group names as keys
{'short': '(?:AAA)', 'long': '(?:AAA.*BBB)'}

Output of overlapping named group regexes
{'short': 'AAA', 'long': 'AAAaoasgosaegnsBBB'}

这可能对某个地方的某人有用。

从 Python 中编译的正则表达式中提取命名组正则表达式模式

问题描述

2 个解决方案

解决方案1
3 已采纳 2018-02-19 21:05:21

解决方案2
1 2018-02-19 02:28:28

从 Python 中编译的正则表达式中提取命名组正则表达式模式

问题描述

2 个解决方案

解决方案1 3 已采纳 2018-02-19 21:05:21

解决方案2 1 2018-02-19 02:28:28

解决方案1
3 已采纳 2018-02-19 21:05:21

解决方案2
1 2018-02-19 02:28:28