I'm trying to parse semi-structured text that's of the following format:
text = "A. xxxxxxx\n\nxxx\n\nxxx\n\n\nB. xxxxxx\n\nxxx\n\nxxx\n\n\nC. xxxxxx\n\nxxx\n\nxxx\n\n\nD. xxxxxx\n\nxxx\n\nxxx"
I'd like to have each of these sections as a different group. I'm currently trying to parse with a regular expression that looks for the text between the uppercase letters followed by a period:
re.findall(r"([A-Z]\.[\s\S]*?)(?:\n[A-Z]\.|$)", text)
However, this only parse parts A and C:
['A. xxxxxxx\n\nxxx\n\nxxx\n\n', 'C. xxxxxx\n\nxxx\n\nxxx\n\n']
How can I modify the regular expression such that the last part of the match is not excluded from future matches?
I can't split by new lines as the number of new lines between the subsections can vary.
Use a lookahead and (optionally) get rid of of capturing group:
>>> print re.findall(r"[A-Z]\.[\s\S]*?(?=\n[A-Z]\.|$)", text)
['A. xxxxxxx\n\nxxx\n\nxxx\n\n', 'B. xxxxxx\n\nxxx\n\nxxx\n\n', 'C. xxxxxx\n\nxxx\n\nxxx\n\n', 'D. xxxxxx\n\nxxx\n\nxxx']
Note use of (?=\\n[AZ]\\.|$)
(zero width lookahead assertion) which only asserts presence of given text without actually matching it.
Try this
[AZ]\\.[^.]*(?<![AZ])
though this one
(?m)^[AZ]\\.(?:(?!^[AZ]\\.)[\\S\\s])*
https://regex101.com/r/t1R28Q/1
will never fail.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.