简体   繁体   中英

Exclude some characters from a regex group

I have a text that contains many articles concatenated into a single string. Each new article starts with = Article 1 = followed by = = Article 1 Section 1 = = , = = Article 1 Section 2 = = and so on. I want to split this string and create a string for each article.

For that I am using regex split

import re
pattern = "=[\s\w\'\(\)]+="
l = re.compile(pattern).split(test_data)

But this isn't giving me the desired result. The article is splitting on sections and subsections as well. I tried excluding multiple = s from matching but didn't find any success and not sure how to proceed on that. I have pasted sample data(two articles) here - Robert Boulder and Kiss You ( One Direction song )

This regex should do the job:

^ *\= [^\=]* \= *$

See it working here:

https://regex101.com/r/HJPHFA/1

Basically matching a '=' followed by a space, any numbers of characters that are NOT '=' (the [^\=] part), then another space and another '='. Also includes optional spaces at the start and end of the line because your sample text has leading and trailing spaces on some lines.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM