![](/img/trans.png)
[英]How to replace the recursive subpattern "(?1)" in python regex syntax?
[英]How to capture all repitions of a subpattern in regex
我有一個格式化的字符串,它可以有任意長度的重復部分。 例如,這是我想要解析的元數據的示例。
File Name: chb03_34.edf
File Start Time: 01:51:23
File End Time: 2:51:23
Number of Seizures in File: 1
Seizure Start Time: 1982 seconds
Seizure End Time: 2029 seconds
File Name: chb23_07.edf
File Start Time: 11:03:16
File End Time: 11:45:56
Number of Seizures in File: 0
File Name: chb23_08.edf
File Start Time: 11:48:05
File End Time: 14:40:27
Number of Seizures in File: 2
Seizure 1 Start Time: 325 seconds
Seizure 1 End Time: 345 seconds
Seizure 2 Start Time: 5104 seconds
Seizure 2 End Time: 5151 seconds
File Name: chb23_09.edf
File Start Time: 14:40:47
File End Time: 18:41:13
Number of Seizures in File: 4
Seizure 1 Start Time: 2589 seconds
Seizure 1 End Time: 2660 seconds
Seizure 2 Start Time: 6885 seconds
Seizure 2 End Time: 6947 seconds
Seizure 3 Start Time: 8505 seconds
Seizure 3 End Time: 8532 seconds
Seizure 4 Start Time: 9580 seconds
Seizure 4 End Time: 9664 seconds
到目前為止,我已經創建了一個捕獲第一行的正則表達式,但只能捕獲最后一次捕獲(如果存在捕獲)在一個塊中。
import re
summary = "a formatted string read"
pattern = "File Name\: (.+)\nFile Start Time\: (.+)\nFile End Time\: (.+)\nNumber of Seizures in File\: (.+)(?:\n|\r|)(?:Seizure(?: | \d )Start Time\: (\d+) seconds\nSeizure(?: | \d )End Time\: (\d+) seconds(?:\n|\r|))*"
pattern = re.compile(pattern)
for p in pattern.finditer(summary):
print(p.groups())
但是,例如最后一個塊的這種模式的結果只會捕獲 seizure 4 的開始和結束時間。 是否可以遞歸地捕獲重復的子模式?
編輯:使用regex
和模式第四只鳥輸入了評論,我可以匹配字符串,但我在重復的行中得到了很多 None 值,而且完全沒有行。 我怎樣才能擺脫那些,或插入適當的價值?
('chb23_06.edf', '08:57:57', '11:02:43', '1', '3962', '4075')
(None, None, None, None, None, None)
('chb23_07.edf', '11:03:16', '11:45:56', '0', None, None)
(None, None, None, None, None, None)
('chb23_08.edf', '11:48:05', '14:40:27', '2', '325', '345')
(None, None, None, None, '5104', '5151')
(None, None, None, None, None, None)
('chb23_09.edf', '14:40:47', '18:41:13', '4', '2589', '2660')
(None, None, None, None, '6885', '6947')
(None, None, None, None, '8505', '8532')
(None, None, None, None, '9580', '9664')
(None, None, None, None, None, None)
('chb23_10.edf', '18:41:40', '22:41:40', '0', None, None)
(None, None, None, None, None, None)
('chb23_16.edf', '13:46:32', '17:46:32', '0', None, None)
(None, None, None, None, None, None)
('chb23_17.edf', '17:46:42', '21:16:29', '0', None, None)
(None, None, None, None, None, None)
('chb23_19.edf', '02:28:28', '6:28:28', '0', None, None)
(None, None, None, None, None, None)
('chb23_20.edf', '06:28:36', '7:52:05', '0', None, None)
(None, None, None, None, None, None)
EDIT2:我做了先前接受的答案的解決方案,但它有一些粗糙的邊緣並且在某些文件中不起作用。 我已經上傳了一個有問題的文件。 您可以在此處找到有問題的元數據示例的粘貼。
使用re
,您可以捕獲一組中 Seizure 字符串的可選迭代,然后從該組中捕獲秒數的數字值:
圖案
File Name: (.+)\nFile Start Time: (.+)\nFile End Time: (.+)\nNumber of Seizures in File: (.+)((?:\nSeizure (?:\d )?Start Time: \d+ seconds\nSeizure (?:\d )?End Time: \d+ seconds)*)
模式匹配:
File Name: (.+)\n
Group 1 ,匹配 File Name: 之后的所有內容和一個換行符File Start Time: (.+)\n
Group 2 ,匹配 File Start Time: 之后的所有內容和換行符File End Time: (.+)\n
Group 3 ,匹配 File End Time: 之后的所有內容和一個換行符Number of Seizures in File: (.+)
Group 4 ,在文件中的癲癇發作次數之后匹配所有:(
第 5 組
(?:
非捕獲組作為一個整體進行匹配,然后可選地重復
\nSeizure (?:\d )?Start Time: \d+ seconds\n
匹配一個換行符匹配Seizure Start Time和結尾的一個換行符Seizure (?:\d )?End Time: \d+ seconds
匹配癲癇發作結束時間)*
關閉非捕獲組並有選擇地重復它)
關閉組 5例如
pattern = re.compile(pattern)
for m in pattern.finditer(summary):
print(m.group(1))
print(m.group(2))
print(m.group(3))
print(m.group(4))
print(re.findall(r"(\d+) seconds", m.group(5)))
每場比賽的 output 看起來像:(或者當沒有 Seizure 值時是一個空列表,但您也可以對其進行測試)
chb23_08.edf
11:48:05
14:40:27
2
['325', '345', '5104', '5151']
如果您使用的是正則表達式模塊,我建議您使用重復捕獲。
為了清楚起見,我還添加了命名組:
import regex
pattern = regex.compile(
r"File Name: (?P<name>.+)\n"
r"File Start Time: (?P<start>.+)\n"
r"File End Time: (?P<end>.+)\n"
r"Number of Seizures in File: (?P<count>\d+)\n"
r"(?:\n|(?:Seizure (?:\d )?Start Time: (?P<seizure_start>\d+) seconds\n"
r"Seizure (?:\d )?End Time: (?P<seizure_end>\d+) seconds\n)*)"
)
summary = """File Name: chb03_34.edf
File Start Time: 01:51:23
File End Time: 2:51:23
Number of Seizures in File: 1
Seizure Start Time: 1982 seconds
Seizure End Time: 2029 seconds
File Name: chb23_07.edf
File Start Time: 11:03:16
File End Time: 11:45:56
Number of Seizures in File: 0
File Name: chb23_08.edf
File Start Time: 11:48:05
File End Time: 14:40:27
Number of Seizures in File: 2
Seizure 1 Start Time: 325 seconds
Seizure 1 End Time: 345 seconds
Seizure 2 Start Time: 5104 seconds
Seizure 2 End Time: 5151 seconds
File Name: chb23_09.edf
File Start Time: 14:40:47
File End Time: 18:41:13
Number of Seizures in File: 4
Seizure 1 Start Time: 2589 seconds
Seizure 1 End Time: 2660 seconds
Seizure 2 Start Time: 6885 seconds
Seizure 2 End Time: 6947 seconds
Seizure 3 Start Time: 8505 seconds
Seizure 3 End Time: 8532 seconds
Seizure 4 Start Time: 9580 seconds
Seizure 4 End Time: 9664 seconds
"""
for match in pattern.finditer(summary):
print("Name:", match.group("name"))
print("Seizure Count", match.group("count"))
seizures = tuple(
zip(match.captures("seizure_start"),match.captures("seizure_end")))
for i, (start, end) in enumerate(seizures, start=1):
print(f"Seizure #{i}: {start} -> {end}")
印刷:
Name: chb03_34.edf
Seizure Count 1
Seizure #1: 1982 -> 2029
Name: chb23_07.edf
Seizure Count 0
Name: chb23_08.edf
Seizure Count 2
Seizure #1: 325 -> 345
Seizure #2: 5104 -> 5151
Name: chb23_09.edf
Seizure Count 4
Seizure #1: 2589 -> 2660
Seizure #2: 6885 -> 6947
Seizure #3: 8505 -> 8532
Seizure #4: 9580 -> 9664
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.