Python正则表达式，重复数据

Question

This seems like a simple task but I have sunk enough time into this to finally ask for help:这似乎是一项简单的任务，但我已经投入了足够的时间来最终寻求帮助：

I have a long text file in roughly this format:我有一个大致如下格式的长文本文件：

Start of test xyz:测试 xyz 开始：

multiple lines of blah blah blah多行等等等等

Start of test wzy:测试开始 wzy：

multiple lines of blah blah blah多行等等等等

Start of test qqq:开始测试QQ：

multiple lines of blah blah blah多行等等等等

I want to grab all the stuff after the "Start of test" deceleration, and this expression gets me about half of what I need:我想在“测试开始”减速后获取所有东西，这个表达式让我得到了我需要的一半：

re.findall(r'Start of test(.+?)Start of test', curfile, re.S)

The most obvious issue is I'm consuming the start of what I need to search for next, thus yielding approximately half of the results I wanted.最明显的问题是我正在消耗我接下来需要搜索的内容的开始，因此产生了我想要的大约一半的结果。 Assuming I could avoid that I still can't figure out how to get the last chunk where there is no "Start of test" to end match to.假设我可以避免我仍然无法弄清楚如何获得没有“测试开始”来结束匹配的最后一个块。

I assume I need to be using negative lookahead assertions, but I am not having much luck figuring out the proper way to use them, I've been trying stuff like:我假设我需要使用否定的前瞻断言，但我没有多少运气找出使用它们的正确方法，我一直在尝试这样的东西：

re.findall(r'Start of test(.+?)(?!Start of test)

which gives no useful results.这没有给出有用的结果。

Answer 1

I think this is the pattern you are looking for我认为这是您正在寻找的模式

Start of test(.+?)(?=Start of test|$)

Then your new code should be那么你的新代码应该是

re.findall(r'Start of test(.+?)Start of test', curfile, re.S)

see demo看演示

Answer 2

You want a lookahead pattern.你想要一个前瞻模式。 See https://docs.python.org/2/library/re.html where it describes (?= ... ) :请参阅https://docs.python.org/2/library/re.html描述(?= ... ) ：

(?=...)
Matches if ... matches next, but doesn't consume any of the string.匹配 if ...匹配 next，但不消耗任何字符串。 This is called a lookahead assertion.这称为先行断言。 For example, Isaac (?=Asimov) will match 'Isaac ' only if it's followed by 'Asimov' .例如， Isaac (?=Asimov)仅在其后跟'Asimov'时才匹配'Isaac ' 'Asimov' 。

So for your case:所以对于你的情况：

re.findall(r'Start of test(.+?)(?=Start of test)', curfile, re.S)

But this will have to be tempered with a non-greedy evaluation.但这必须通过非贪婪的评估来缓和。

Answer 3

It might be more useful to use re.finditer to get an iterable of match objects, and then use mo.start(0) on each match object to find out where in the original string the current match is.使用re.finditer获取匹配对象的迭代可能更有用，然后在每个匹配对象上使用mo.start(0)以找出当前匹配在原始字符串中的位置。 Then, you can recover everything in between matches in the following way -- notice that my pattern only matches a single "Start of test" line:然后，您可以通过以下方式恢复匹配之间的所有内容 - 请注意，我的模式仅匹配单个“测试开始”行：

pattern = r'^Start of test (.*):$'
matches = re.finditer(pattern, curfile, re.M)
i = 0  # where the last match ended
names = []
in_between = []
for mo in matches:
    j = mo.start(0)
    in_between = curfile[i:j]  # store what came before this match
    i = mo.end(0)  # store the new "end of match" position
    names.append(mo.group(1))  # store the matched name
in_between.append(curfile[i:])  # store the rest of the file

# in_between[0] is what came before the first test
chunks = in_between[1:]

Python正则表达式，重复数据

问题描述

3 个解决方案

解决方案1
1 已采纳 2015-10-18 16:52:06

解决方案2
0 2015-10-18 16:38:02

解决方案3
0 2015-10-18 16:42:50

Python正则表达式，重复数据

问题描述

3 个解决方案

解决方案1 1 已采纳 2015-10-18 16:52:06

解决方案2 0 2015-10-18 16:38:02

解决方案3 0 2015-10-18 16:42:50

解决方案1
1 已采纳 2015-10-18 16:52:06

解决方案2
0 2015-10-18 16:38:02

解决方案3
0 2015-10-18 16:42:50