简体   繁体   中英

Python re findall matching repeating pattern

I'm trying to parse semi-structured text that's of the following format:

text = "A. xxxxxxx\n\nxxx\n\nxxx\n\n\nB. xxxxxx\n\nxxx\n\nxxx\n\n\nC. xxxxxx\n\nxxx\n\nxxx\n\n\nD. xxxxxx\n\nxxx\n\nxxx"

I'd like to have each of these sections as a different group. I'm currently trying to parse with a regular expression that looks for the text between the uppercase letters followed by a period:

re.findall(r"([A-Z]\.[\s\S]*?)(?:\n[A-Z]\.|$)", text)

However, this only parse parts A and C:

['A. xxxxxxx\n\nxxx\n\nxxx\n\n', 'C. xxxxxx\n\nxxx\n\nxxx\n\n']

How can I modify the regular expression such that the last part of the match is not excluded from future matches?

I can't split by new lines as the number of new lines between the subsections can vary.

Use a lookahead and (optionally) get rid of of capturing group:

>>> print re.findall(r"[A-Z]\.[\s\S]*?(?=\n[A-Z]\.|$)", text)
['A. xxxxxxx\n\nxxx\n\nxxx\n\n', 'B. xxxxxx\n\nxxx\n\nxxx\n\n', 'C. xxxxxx\n\nxxx\n\nxxx\n\n', 'D. xxxxxx\n\nxxx\n\nxxx']

Note use of (?=\\n[AZ]\\.|$) (zero width lookahead assertion) which only asserts presence of given text without actually matching it.

Try this

[AZ]\\.[^.]*(?<![AZ])

though this one

(?m)^[AZ]\\.(?:(?!^[AZ]\\.)[\\S\\s])*

https://regex101.com/r/t1R28Q/1

will never fail.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM