简体   繁体   中英

Python re findall matching repeating pattern

I'm trying to parse semi-structured text that's of the following format:

text = "A. xxxxxxx\n\nxxx\n\nxxx\n\n\nB. xxxxxx\n\nxxx\n\nxxx\n\n\nC. xxxxxx\n\nxxx\n\nxxx\n\n\nD. xxxxxx\n\nxxx\n\nxxx"

I'd like to have each of these sections as a different group. I'm currently trying to parse with a regular expression that looks for the text between the uppercase letters followed by a period:

re.findall(r"([A-Z]\.[\s\S]*?)(?:\n[A-Z]\.|$)", text)

However, this only parse parts A and C:

['A. xxxxxxx\n\nxxx\n\nxxx\n\n', 'C. xxxxxx\n\nxxx\n\nxxx\n\n']

How can I modify the regular expression such that the last part of the match is not excluded from future matches?

I can't split by new lines as the number of new lines between the subsections can vary.

Use a lookahead and (optionally) get rid of of capturing group:

>>> print re.findall(r"[A-Z]\.[\s\S]*?(?=\n[A-Z]\.|$)", text)
['A. xxxxxxx\n\nxxx\n\nxxx\n\n', 'B. xxxxxx\n\nxxx\n\nxxx\n\n', 'C. xxxxxx\n\nxxx\n\nxxx\n\n', 'D. xxxxxx\n\nxxx\n\nxxx']

Note use of (?=\\n[AZ]\\.|$) (zero width lookahead assertion) which only asserts presence of given text without actually matching it.

Try this


though this one



will never fail.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM