Python重新查找所有匹配的重复模式

Question

I'm trying to parse semi-structured text that's of the following format: 我正在尝试解析以下格式的半结构化文本：

text = "A. xxxxxxx\n\nxxx\n\nxxx\n\n\nB. xxxxxx\n\nxxx\n\nxxx\n\n\nC. xxxxxx\n\nxxx\n\nxxx\n\n\nD. xxxxxx\n\nxxx\n\nxxx"

I'd like to have each of these sections as a different group. 我想将每个部分都归为不同的组。 I'm currently trying to parse with a regular expression that looks for the text between the uppercase letters followed by a period: 我目前正在尝试使用正则表达式进行解析，该正则表达式在大写字母后跟一个句点的文本中查找文本：

re.findall(r"([A-Z]\.[\s\S]*?)(?:\n[A-Z]\.|$)", text)

However, this only parse parts A and C: 但是，这仅解析部分A和C：

['A. xxxxxxx\n\nxxx\n\nxxx\n\n', 'C. xxxxxx\n\nxxx\n\nxxx\n\n']

How can I modify the regular expression such that the last part of the match is not excluded from future matches? 如何修改正则表达式，以使匹配的最后部分不会从以后的匹配中排除？

I can't split by new lines as the number of new lines between the subsections can vary. 由于各小节之间的新行数量可能会有所不同，因此我无法按新行分开。

Answer 1

Use a lookahead and (optionally) get rid of of capturing group: 提前使用（可选）摆脱捕获组：

>>> print re.findall(r"[A-Z]\.[\s\S]*?(?=\n[A-Z]\.|$)", text)
['A. xxxxxxx\n\nxxx\n\nxxx\n\n', 'B. xxxxxx\n\nxxx\n\nxxx\n\n', 'C. xxxxxx\n\nxxx\n\nxxx\n\n', 'D. xxxxxx\n\nxxx\n\nxxx']

Note use of (?=\\n[AZ]\\.|$) (zero width lookahead assertion) which only asserts presence of given text without actually matching it. 请注意使用(?=\\n[AZ]\\.|$) （零宽度超前断言），该断言仅断言给定文本的存在，而不实际匹配它。

Answer 2

Try this 尝试这个

[AZ]\\.[^.]*(?<![AZ])

though this one 虽然这个

(?m)^[AZ]\\.(?:(?!^[AZ]\\.)[\\S\\s])*

https://regex101.com/r/t1R28Q/1 https://regex101.com/r/t1R28Q/1

will never fail. 永远不会失败。

Python重新查找所有匹配的重复模式

问题描述

2 个解决方案

解决方案1
3 已采纳 2019-08-07 20:22:11

解决方案2
1

Python重新查找所有匹配的重复模式

问题描述

2 个解决方案

解决方案1 3 已采纳 2019-08-07 20:22:11

解决方案2 1

解决方案1
3 已采纳 2019-08-07 20:22:11

解决方案2
1