[英]Exclude some characters from a regex group
I have a text that contains many articles concatenated into a single string.我有一个文本,其中包含许多连接成单个字符串的文章。 Each new article starts with
= Article 1 =
followed by = = Article 1 Section 1 = =
, = = Article 1 Section 2 = =
and so on.每篇新文章都以
= Article 1 =
开头,然后是= = Article 1 Section 1 = =
, = = Article 1 Section 2 = =
等等。 I want to split this string and create a string for each article.我想拆分这个字符串并为每篇文章创建一个字符串。
For that I am using regex split为此,我正在使用正则表达式拆分
import re
pattern = "=[\s\w\'\(\)]+="
l = re.compile(pattern).split(test_data)
But this isn't giving me the desired result.但这并没有给我想要的结果。 The article is splitting on sections and subsections as well.
这篇文章也分为部分和小节。 I tried excluding multiple
=
s from matching but didn't find any success and not sure how to proceed on that.我尝试从匹配中排除多个
=
s,但没有发现任何成功,也不知道如何继续。 I have pasted sample data(two articles) here - Robert Boulder
and Kiss You ( One Direction song )
我在这里粘贴了示例数据(两篇文章) -
Robert Boulder
和Kiss You ( One Direction song )
This regex should do the job:这个正则表达式应该可以完成这项工作:
^ *\= [^\=]* \= *$
See it working here:看到它在这里工作:
https://regex101.com/r/HJPHFA/1 https://regex101.com/r/HJPHFA/1
Basically matching a '=' followed by a space, any numbers of characters that are NOT '=' (the [^\=]
part), then another space and another '='.基本上匹配一个'='后跟一个空格,任意数量的不是'='的字符(
[^\=]
部分),然后是另一个空格和另一个'='。 Also includes optional spaces at the start and end of the line because your sample text has leading and trailing spaces on some lines.还包括在行首和行尾的可选空格,因为您的示例文本在某些行上有前导和尾随空格。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.