Python正則表達式，重復數據

Question

這似乎是一項簡單的任務，但我已經投入了足夠的時間來最終尋求幫助：

我有一個大致如下格式的長文本文件：

測試 xyz 開始：

多行等等等等

測試開始 wzy：

多行等等等等

開始測試QQ：

多行等等等等

我想在“測試開始”減速后獲取所有東西，這個表達式讓我得到了我需要的一半：

re.findall(r'Start of test(.+?)Start of test', curfile, re.S)

最明顯的問題是我正在消耗我接下來需要搜索的內容的開始，因此產生了我想要的大約一半的結果。 假設我可以避免我仍然無法弄清楚如何獲得沒有“測試開始”來結束匹配的最后一個塊。

我假設我需要使用否定的前瞻斷言，但我沒有多少運氣找出使用它們的正確方法，我一直在嘗試這樣的東西：

re.findall(r'Start of test(.+?)(?!Start of test)

這沒有給出有用的結果。

Answer 1

我認為這是您正在尋找的模式

Start of test(.+?)(?=Start of test|$)

那么你的新代碼應該是

re.findall(r'Start of test(.+?)Start of test', curfile, re.S)

看演示

Answer 2

你想要一個前瞻模式。 請參閱https://docs.python.org/2/library/re.html描述(?= ... ) ：

(?=...)
匹配 if ...匹配 next，但不消耗任何字符串。 這稱為先行斷言。 例如， Isaac (?=Asimov)僅在其后跟'Asimov'時才匹配'Isaac ' 'Asimov' 。

所以對於你的情況：

re.findall(r'Start of test(.+?)(?=Start of test)', curfile, re.S)

但這必須通過非貪婪的評估來緩和。

Answer 3

使用re.finditer獲取匹配對象的迭代可能更有用，然后在每個匹配對象上使用mo.start(0)以找出當前匹配在原始字符串中的位置。 然后，您可以通過以下方式恢復匹配之間的所有內容 - 請注意，我的模式僅匹配單個“測試開始”行：

pattern = r'^Start of test (.*):$'
matches = re.finditer(pattern, curfile, re.M)
i = 0  # where the last match ended
names = []
in_between = []
for mo in matches:
    j = mo.start(0)
    in_between = curfile[i:j]  # store what came before this match
    i = mo.end(0)  # store the new "end of match" position
    names.append(mo.group(1))  # store the matched name
in_between.append(curfile[i:])  # store the rest of the file

# in_between[0] is what came before the first test
chunks = in_between[1:]

Python正則表達式，重復數據

問題描述

3 個解決方案

解決方案1
1 已采納 2015-10-18 16:52:06

解決方案2
0 2015-10-18 16:38:02

解決方案3
0 2015-10-18 16:42:50

Python正則表達式，重復數據

問題描述

3 個解決方案

解決方案1 1 已采納 2015-10-18 16:52:06

解決方案2 0 2015-10-18 16:38:02

解決方案3 0 2015-10-18 16:42:50

解決方案1
1 已采納 2015-10-18 16:52:06

解決方案2
0 2015-10-18 16:38:02

解決方案3
0 2015-10-18 16:42:50