在 Python 中使用正则表达式解析具有重复模式的字符串？

Question

I read a text file line by line with a Python script .我用Python 脚本逐行读取文本文件。 What I get is a list of strings, one string per line.我得到的是一个字符串列表，每行一个字符串。 I now need to parse each string into more manageable data (ie strings, integers).我现在需要将每个字符串解析为更易于管理的数据（即字符串、整数）。

The strings look similar to this:字符串看起来类似于：

"the description (number)" (eg "door (0)") “描述（编号）”（例如“门（0）”）
"the description (number|number|number)" (eg "window (1|22|4)) “描述（编号|编号|编号）”（例如“窗口（1|22|4）”）
"the description (number|number|number|number)" (eg "toilet (2|6|5|10)) “描述（号码|号码|号码|号码）”（例如“厕所（2|6|5|10）”）

Now what I want is a list of split/parsed strings for each line from the text file that I can process further, for instance:现在我想要的是我可以进一步处理的文本文件中每一行的拆分/解析字符串列表，例如：

"window (1|22|4)" -> [ "window", "1", "22", "4" ] "窗口 (1|22|4)" -> [ "窗口", "1", "22", "4" ]

I guess regular expressions are the best fit to accomplish this and I already managed to come up with this:我想正则表达式最适合实现这一点，我已经设法想出了这个：

(.+)\\s+((\\d+)\\) , which perfectly matches for instance [ “door", "0" ] for "door (0)" (.+)\\s+((\\d+)\\) ，它完美匹配例如 [“door”, "0" ] 代表“door (0)”

However, some items have more data to parse:但是，有些项目需要解析更多数据：

(.+)\\s((\\d+)+\\|\\) , which matches only [ "window", "1" ] for "window (1|22|4) (.+)\\s((\\d+)+\\|\\) ，只匹配 [ "window", "1" ] 表示 "window (1|22|4)

How can I repeat the pattern matching for the part (\\d+)+\\|如何重复部分(\\d+)+\\|的模式匹配(ie "1|") up to the closing parenthesis for an undefined number repetitions of this pattern? （即“1|”）直到此模式的未定义次数重复的右括号？ The last item to match would be an integer, which could be caught separately with (\\d+)\\) .要匹配的最后一项将是一个整数，可以用(\\d+)\\)单独捕获。

Also is there a way to match either the simple or the extended case with a single regular expression?还有一种方法可以将简单或扩展的情况与单个正则表达式匹配吗？

Thanks!谢谢！ And have a nice weekend, everybody!祝大家周末愉快！

Answer 1

Here's the regex: \\w+ \\((\\d+\\|)*\\d+\\) .这是正则表达式： \\w+ \\((\\d+\\|)*\\d+\\) 。 But imo you should do a mix of regex and str.split但是你应该混合使用正则表达式和str.split

data = []
with open("f.txt") as f:
    for line in f:
        word, numbers = re.search(r"(\w+) \(([^)]+)\)", line).groups()
        data.append((word, *numbers.split("|")))

print(data) # [('door', '0'), ('window', '1', '22', '4')]

Answer 2

import re
a = [r'door (0)',
    r'window (1|22|4)',
    r'toilet (2|6|5|10)'
]
for i in a: 
    print(re.findall('(\w+)',i))

Result:结果：

['door', '0']
['window', '1', '22', '4']
['toilet', '2', '6', '5', '10']

Answer 3

Not a raw regex, but another way to extract and process that data can be to use TTP template不是原始正则表达式，而是提取和处理该数据的另一种方法是使用TTP模板

from ttp import ttp

template = """
<macro>
def process_matches(data):
    data["numbers"] = data["numbers"].split("|")
    return data
</macro>

<group name="{{ thing }}" macro="process_matches">
{{ thing }} ({{ numbers }})
</group>
"""

data = """
door (0)
window (1|22|4)
toilet (2|6|5|10)
"""

parser = ttp(data, template)
parser.parse()
print(parser.result(format="pprint")[0])

above code would produce上面的代码会产生

[   {   'door': {'numbers': ['0']},
        'toilet': {'numbers': ['2', '6', '5', '10']},
        'window': {'numbers': ['1', '22', '4']}}]

在 Python 中使用正则表达式解析具有重复模式的字符串？

问题描述

3 个解决方案

解决方案1
1 已采纳 2019-12-07 18:05:57

解决方案2
0 2019-12-07 18:08:42

解决方案3
0 2019-12-23 11:05:50

在 Python 中使用正则表达式解析具有重复模式的字符串？

问题描述

3 个解决方案

解决方案1 1 已采纳 2019-12-07 18:05:57

解决方案2 0 2019-12-07 18:08:42

解决方案3 0 2019-12-23 11:05:50

解决方案1
1 已采纳 2019-12-07 18:05:57

解决方案2
0 2019-12-07 18:08:42

解决方案3
0 2019-12-23 11:05:50