Python re findall matching repeating pattern

Question

I'm trying to parse semi-structured text that's of the following format:

text = "A. xxxxxxx\n\nxxx\n\nxxx\n\n\nB. xxxxxx\n\nxxx\n\nxxx\n\n\nC. xxxxxx\n\nxxx\n\nxxx\n\n\nD. xxxxxx\n\nxxx\n\nxxx"

I'd like to have each of these sections as a different group. I'm currently trying to parse with a regular expression that looks for the text between the uppercase letters followed by a period:

re.findall(r"([A-Z]\.[\s\S]*?)(?:\n[A-Z]\.|$)", text)

However, this only parse parts A and C:

['A. xxxxxxx\n\nxxx\n\nxxx\n\n', 'C. xxxxxx\n\nxxx\n\nxxx\n\n']

How can I modify the regular expression such that the last part of the match is not excluded from future matches?

I can't split by new lines as the number of new lines between the subsections can vary.

Answer 1

Use a lookahead and (optionally) get rid of of capturing group:

>>> print re.findall(r"[A-Z]\.[\s\S]*?(?=\n[A-Z]\.|$)", text)
['A. xxxxxxx\n\nxxx\n\nxxx\n\n', 'B. xxxxxx\n\nxxx\n\nxxx\n\n', 'C. xxxxxx\n\nxxx\n\nxxx\n\n', 'D. xxxxxx\n\nxxx\n\nxxx']

Note use of (?=\\n[AZ]\\.|$) (zero width lookahead assertion) which only asserts presence of given text without actually matching it.

Answer 2

Try this

[AZ]\\.[^.]*(?<![AZ])

though this one

(?m)^[AZ]\\.(?:(?!^[AZ]\\.)[\\S\\s])*

https://regex101.com/r/t1R28Q/1

will never fail.

Python re findall matching repeating pattern

Question

2 answers

solution1
3 ACCPTED 2019-08-07 20:22:11

solution2
1

Python re findall matching repeating pattern

Question

2 answers

solution1 3 ACCPTED 2019-08-07 20:22:11

solution2 1

solution1
3 ACCPTED 2019-08-07 20:22:11

solution2
1