简体   繁体   English

从正则表达式组中排除某些字符

[英]Exclude some characters from a regex group

I have a text that contains many articles concatenated into a single string.我有一个文本,其中包含许多连接成单个字符串的文章。 Each new article starts with = Article 1 = followed by = = Article 1 Section 1 = = , = = Article 1 Section 2 = = and so on.每篇新文章都以= Article 1 =开头,然后是= = Article 1 Section 1 = == = Article 1 Section 2 = =等等。 I want to split this string and create a string for each article.我想拆分这个字符串并为每篇文章创建一个字符串。

For that I am using regex split为此,我正在使用正则表达式拆分

import re
pattern = "=[\s\w\'\(\)]+="
l = re.compile(pattern).split(test_data)

But this isn't giving me the desired result.但这并没有给我想要的结果。 The article is splitting on sections and subsections as well.这篇文章也分为部分和小节。 I tried excluding multiple = s from matching but didn't find any success and not sure how to proceed on that.我尝试从匹配中排除多个= s,但没有发现任何成功,也不知道如何继续。 I have pasted sample data(two articles) here - Robert Boulder and Kiss You ( One Direction song )我在这里粘贴了示例数据(两篇文章) - Robert BoulderKiss You ( One Direction song )

This regex should do the job:这个正则表达式应该可以完成这项工作:

^ *\= [^\=]* \= *$

See it working here:看到它在这里工作:

https://regex101.com/r/HJPHFA/1 https://regex101.com/r/HJPHFA/1

Basically matching a '=' followed by a space, any numbers of characters that are NOT '=' (the [^\=] part), then another space and another '='.基本上匹配一个'='后跟一个空格,任意数量的不是'='的字符( [^\=]部分),然后是另一个空格和另一个'='。 Also includes optional spaces at the start and end of the line because your sample text has leading and trailing spaces on some lines.还包括在行首和行尾的可选空格,因为您的示例文本在某些行上有前导和尾随空格。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM