简体   繁体   中英

How to capture all repitions of a subpattern in regex

I have a formatted string, that can have a repeated part of arbitrary length. For example, here is an example of the metadata I have that I want to parse.

File Name: chb03_34.edf
File Start Time: 01:51:23
File End Time: 2:51:23
Number of Seizures in File: 1
Seizure Start Time: 1982 seconds
Seizure End Time: 2029 seconds

File Name: chb23_07.edf
File Start Time: 11:03:16
File End Time: 11:45:56
Number of Seizures in File: 0

File Name: chb23_08.edf
File Start Time: 11:48:05
File End Time: 14:40:27
Number of Seizures in File: 2
Seizure 1 Start Time: 325 seconds
Seizure 1 End Time: 345 seconds
Seizure 2 Start Time: 5104 seconds
Seizure 2 End Time: 5151 seconds

File Name: chb23_09.edf
File Start Time: 14:40:47
File End Time: 18:41:13
Number of Seizures in File: 4
Seizure 1 Start Time: 2589 seconds
Seizure 1 End Time: 2660 seconds
Seizure 2 Start Time: 6885 seconds
Seizure 2 End Time: 6947 seconds
Seizure 3 Start Time: 8505 seconds
Seizure 3 End Time: 8532 seconds
Seizure 4 Start Time: 9580 seconds
Seizure 4 End Time: 9664 seconds

So far I've created a regex that captures the first lines, but can only capture the last seizure, if a seizure exists, in a block.

import re

summary = "a formatted string read"


pattern = "File Name\: (.+)\nFile Start Time\: (.+)\nFile End Time\: (.+)\nNumber of Seizures in File\: (.+)(?:\n|\r|)(?:Seizure(?: | \d )Start Time\: (\d+) seconds\nSeizure(?: | \d )End Time\: (\d+) seconds(?:\n|\r|))*"
pattern = re.compile(pattern)

for p in pattern.finditer(summary):
    print(p.groups())

But the result of such pattern for the last block for example will only capture the seizure 4 start and end time. Is it possible to capture a repeated subpattern recursively?

EDIT: using regex and the pattern The fourth bird has typed in the comments, I can match the strings, but I get a lot of None values in repeated rows, and also completely None rows. How can I get rid of those, or insert the appropriate value?

('chb23_06.edf', '08:57:57', '11:02:43', '1', '3962', '4075')
(None, None, None, None, None, None)
('chb23_07.edf', '11:03:16', '11:45:56', '0', None, None)
(None, None, None, None, None, None)
('chb23_08.edf', '11:48:05', '14:40:27', '2', '325', '345')
(None, None, None, None, '5104', '5151')
(None, None, None, None, None, None)
('chb23_09.edf', '14:40:47', '18:41:13', '4', '2589', '2660')
(None, None, None, None, '6885', '6947')
(None, None, None, None, '8505', '8532')
(None, None, None, None, '9580', '9664')
(None, None, None, None, None, None)
('chb23_10.edf', '18:41:40', '22:41:40', '0', None, None)
(None, None, None, None, None, None)
('chb23_16.edf', '13:46:32', '17:46:32', '0', None, None)
(None, None, None, None, None, None)
('chb23_17.edf', '17:46:42', '21:16:29', '0', None, None)
(None, None, None, None, None, None)
('chb23_19.edf', '02:28:28', '6:28:28', '0', None, None)
(None, None, None, None, None, None)
('chb23_20.edf', '06:28:36', '7:52:05', '0', None, None)
(None, None, None, None, None, None)

EDIT2: I did the solution of the previously accepted answer, but it has some rough edges and is not working in some files. I've uploaded one of the problematic files. You can find a paste of a sample of the problematic metadata in here .

Using re , you can capture the optional iterations of the Seizure strings in a group, and then from that group capture the digit values for the seconds:

Pattern

File Name: (.+)\nFile Start Time: (.+)\nFile End Time: (.+)\nNumber of Seizures in File: (.+)((?:\nSeizure (?:\d )?Start Time: \d+ seconds\nSeizure (?:\d )?End Time: \d+ seconds)*)

The pattern matches:

  • File Name: (.+)\n Group 1 , match all after File Name: and a newline
  • File Start Time: (.+)\n Group 2 , match all after File Start Time: and a newline
  • File End Time: (.+)\n Group 3 , match all after File End Time: and a newline
  • Number of Seizures in File: (.+) Group 4 , match all after Number of Seizures in File:
  • ( Group 5
    • (?: Non capture group to match as a whole and then optionally repeat
      • \nSeizure (?:\d )?Start Time: \d+ seconds\n Match a newline and match the Seizure Start Time and a newline at the end
      • Seizure (?:\d )?End Time: \d+ seconds Match the Seizure End Time
    • )* Close the non capture group and optionally repeat it
  • ) Close group 5

Regex demo | Python demo

For example

pattern = re.compile(pattern)

for m in pattern.finditer(summary):
    print(m.group(1))
    print(m.group(2))
    print(m.group(3))
    print(m.group(4))
    print(re.findall(r"(\d+) seconds", m.group(5)))

The output per match would look like: (or an empty list when there are no Seizure values, but you can test for that as well)

chb23_08.edf
11:48:05
14:40:27
2
['325', '345', '5104', '5151']

If you're using the regex module, I would suggest using repeated captures.

I've also added named groups for clarity:

import regex

pattern = regex.compile(
    r"File Name: (?P<name>.+)\n"
    r"File Start Time: (?P<start>.+)\n"
    r"File End Time: (?P<end>.+)\n"
    r"Number of Seizures in File: (?P<count>\d+)\n"
    r"(?:\n|(?:Seizure (?:\d )?Start Time: (?P<seizure_start>\d+) seconds\n"
    r"Seizure (?:\d )?End Time: (?P<seizure_end>\d+) seconds\n)*)"
)

summary = """File Name: chb03_34.edf
File Start Time: 01:51:23
File End Time: 2:51:23
Number of Seizures in File: 1
Seizure Start Time: 1982 seconds
Seizure End Time: 2029 seconds

File Name: chb23_07.edf
File Start Time: 11:03:16
File End Time: 11:45:56
Number of Seizures in File: 0

File Name: chb23_08.edf
File Start Time: 11:48:05
File End Time: 14:40:27
Number of Seizures in File: 2
Seizure 1 Start Time: 325 seconds
Seizure 1 End Time: 345 seconds
Seizure 2 Start Time: 5104 seconds
Seizure 2 End Time: 5151 seconds

File Name: chb23_09.edf
File Start Time: 14:40:47
File End Time: 18:41:13
Number of Seizures in File: 4
Seizure 1 Start Time: 2589 seconds
Seizure 1 End Time: 2660 seconds
Seizure 2 Start Time: 6885 seconds
Seizure 2 End Time: 6947 seconds
Seizure 3 Start Time: 8505 seconds
Seizure 3 End Time: 8532 seconds
Seizure 4 Start Time: 9580 seconds
Seizure 4 End Time: 9664 seconds
"""

for match in pattern.finditer(summary):
    print("Name:", match.group("name"))
    print("Seizure Count", match.group("count"))
    seizures = tuple(
        zip(match.captures("seizure_start"),match.captures("seizure_end")))
    for i, (start, end) in enumerate(seizures, start=1):
        print(f"Seizure #{i}: {start} -> {end}")

Prints:

Name: chb03_34.edf
Seizure Count 1
Seizure #1: 1982 -> 2029
Name: chb23_07.edf
Seizure Count 0
Name: chb23_08.edf
Seizure Count 2
Seizure #1: 325 -> 345
Seizure #2: 5104 -> 5151
Name: chb23_09.edf
Seizure Count 4
Seizure #1: 2589 -> 2660
Seizure #2: 6885 -> 6947
Seizure #3: 8505 -> 8532
Seizure #4: 9580 -> 9664

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM