從 Python 中編譯的正則表達式中提取命名組正則表達式模式

Question

我有一個包含多個命名組的 Python 正則表達式。 但是，如果之前的組匹配，則匹配一組的模式可能會被遺漏，因為似乎不允許重疊。 舉個例子：

import re
myText = 'sgasgAAAaoasgosaegnsBBBausgisego'
myRegex = re.compile('(?P<short>(?:AAA))|(?P<long>(?:AAA.*BBB))')

x = re.findall(myRegex,myText)
print(x)

產生輸出：

[('AAA', '')]

“長”組找不到匹配項，因為“AAA”在為前面的“短”組查找匹配項時已用完。

我試圖找到一種允許重疊但失敗的方法。 作為替代方案，我一直在尋找一種方法來分別運行每個命名組。 類似於以下內容：

for g in myRegex.groupindex.keys():
    match = re.findall(***regex_for_named_group_g***,myText)

是否可以為每個命名組提取正則表達式？

最終，我想生成一個字典輸出（或類似的），如：

{'short':'AAA',
 'long':'AAAaoasgosaegnsBBB'}

任何和所有建議將不勝感激。

Answer 1

確實似乎沒有更好的方法來做到這一點，但這是另一種方法，沿着另一個答案的路線但稍微簡單一些。 如果 a) 您的模式將始終形成為一系列由管道分隔的命名組，並且 b) 命名組模式從不包含命名組本身，則它將起作用。

如果您對每個模式的所有匹配項感興趣，以下將是我的方法。 re.split的參數查找文字管道，后跟(?=< ，命名組的開頭。它編譯每個子模式並使用groupindex屬性來提取名稱。

def nameToMatches(pattern, string):
    result = dict()
    for subpattern in re.split('\|(?=\(\?P<)', pattern):
        rx = re.compile(subpattern)
        name = list(rx.groupindex)[0]
        result[name] = rx.findall(string)
    return result

使用給定的文本和模式，返回{'long': ['AAAaoasgosaegnsBBB'], 'short': ['AAA']} 。 根本不匹配的模式將為其值提供一個空列表。

如果你只想要一個模式匹配，你可以讓它更簡單一點：

def nameToMatch(pattern, string):
    result = dict()
    for subpattern in re.split('\|(?=\(\?P<)', pattern):
        match = re.search(subpattern, string)
        if match:
            result.update(match.groupdict())
    return result

這為您提供了{'long': 'AAAaoasgosaegnsBBB', 'short': 'AAA'} 。 如果其中一個命名組根本不匹配，則字典中將不存在該組。

Answer 2

似乎沒有明顯的答案，所以這里有一個 hack。 它需要一些技巧，但基本上它將原始正則表達式拆分為其組成部分，並在原始文本上分別運行每個組正則表達式。

import re

myTextStr = 'sgasgAAAaoasgosaegnsBBBausgisego'
myRegexStr = '(?P<short>(?:AAA))|(?P<long>(?:AAA.*BBB))'
myRegex = re.compile(myRegexStr)   # This is actually no longer needed

print("Full regex with multiple groups")
print(myRegexStr)

# Use a regex to split the original regex into separate regexes
# based on group names
mySplitGroupsRegexStr = '\(\?P<(\w+)>(\([\w\W]+?\))\)(?:\||\Z)'
mySplitGroupsRegex = re.compile(mySplitGroupsRegexStr)
mySepRegexesList = re.findall(mySplitGroupsRegex,myRegexStr)

print("\nList of separate regexes")
print(mySepRegexesList)

# Convert separate regexes to a dict with group name as key
# and regex as value
mySepRegexDict = {reg[0]:reg[1] for reg in mySepRegexesList}
print("\nDictionary of separate regexes with group names as keys")
print(mySepRegexDict)

# Step through each key and run the group regex on the original text.
# Results are stored in a dictionary with group name as key and
# extracted text as value.
myGroupRegexOutput = {}
for g,r in mySepRegexDict.items():
    m = re.findall(re.compile(r),myTextStr)
    myGroupRegexOutput[g] = m[0]

print("\nOutput of overlapping named group regexes")
print(myGroupRegexOutput)

結果輸出是：

Full regex with multiple groups
(?P<short>(?:AAA))|(?P<long>(?:AAA.*BBB))

List of separate regexes
[('short', '(?:AAA)'), ('long', '(?:AAA.*BBB)')]

Dictionary of separate regexes with group names as keys
{'short': '(?:AAA)', 'long': '(?:AAA.*BBB)'}

Output of overlapping named group regexes
{'short': 'AAA', 'long': 'AAAaoasgosaegnsBBB'}

這可能對某個地方的某人有用。

從 Python 中編譯的正則表達式中提取命名組正則表達式模式

問題描述

2 個解決方案

解決方案1
3 已采納 2018-02-19 21:05:21

解決方案2
1 2018-02-19 02:28:28

從 Python 中編譯的正則表達式中提取命名組正則表達式模式

問題描述

2 個解決方案

解決方案1 3 已采納 2018-02-19 21:05:21

解決方案2 1 2018-02-19 02:28:28

解決方案1
3 已采納 2018-02-19 21:05:21

解決方案2
1 2018-02-19 02:28:28