简体   繁体   中英

Matching groups in a Python regex lookahead

I have a ~raw download of text data from a Wordpress blog, structured as follows:

POST_ID_1 TITLE_1 DATE_1

This is the text from the first post ..

POST_ID_2 TITLE_2 DATE_2

This is the text from the second post ..

I wrote some regex to capture the POST_ID , TITLE , and DATE . My goal is to create a Python dictionary structured as:

posts = {'DATE_1': {'post_id': POST_ID_1,
                    'title': TITLE_1,
                    'text': 'This is the text from the first post ..'
                    }
        }

The regex to capture the headers ( POST_ID , TITLE , DATE ) is as follows:

header_regex_raw = r"""(\d+)\s(.*(?=January|February|March|April|May|June|July|August|September|October|November|December))(January|February|March|April|May|June|July|August|September|October|November|December)(\s\d+\,\s\d{4}\b)"""

My thought is to do something like re.findall(header_regex_raw + (.*(?={})).format(header_regex_raw) , but unfortunately this doesn't work as planned.

How do I capture multiple groups in a lookahead? What's a better way to create the above dict?

I found a clean function for this in the Python re module: re.split .

header_regex_raw = r"""(\d+)\s(.+?(?=January|February|March|April|May|June|July|August|September|October|November|December))((January|February|March|April|May|June|July|August|September|October|November|December)(\s\d+\,\s\d{4}\b))"""
header_text_header = re.compile(header_regex_raw)
ret = header_text_header.split(data.strip())

This does exactly what I want: it captures the header elements in groups, the text that follows in another group, the following header elements in groups, etc.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM