Matching groups in a Python regex lookahead

Question

I have a ~raw download of text data from a Wordpress blog, structured as follows:

POST_ID_1 TITLE_1 DATE_1

This is the text from the first post ..

POST_ID_2 TITLE_2 DATE_2

This is the text from the second post ..

I wrote some regex to capture the POST_ID , TITLE , and DATE . My goal is to create a Python dictionary structured as:

posts = {'DATE_1': {'post_id': POST_ID_1,
                    'title': TITLE_1,
                    'text': 'This is the text from the first post ..'
                    }
        }

The regex to capture the headers ( POST_ID , TITLE , DATE ) is as follows:

header_regex_raw = r"""(\d+)\s(.*(?=January|February|March|April|May|June|July|August|September|October|November|December))(January|February|March|April|May|June|July|August|September|October|November|December)(\s\d+\,\s\d{4}\b)"""

My thought is to do something like re.findall(header_regex_raw + (.*(?={})).format(header_regex_raw) , but unfortunately this doesn't work as planned.

How do I capture multiple groups in a lookahead? What's a better way to create the above dict?

Answer 1

I found a clean function for this in the Python re module: re.split .

header_regex_raw = r"""(\d+)\s(.+?(?=January|February|March|April|May|June|July|August|September|October|November|December))((January|February|March|April|May|June|July|August|September|October|November|December)(\s\d+\,\s\d{4}\b))"""
header_text_header = re.compile(header_regex_raw)
ret = header_text_header.split(data.strip())

This does exactly what I want: it captures the header elements in groups, the text that follows in another group, the following header elements in groups, etc.

Matching groups in a Python regex lookahead

Question

1 answers

solution1
1 2015-06-14 13:47:31

Matching groups in a Python regex lookahead

Question

1 answers

solution1 1 2015-06-14 13:47:31

solution1
1 2015-06-14 13:47:31