Regex in Python: extract a multiline part from a text with repeating similar editions

Question

Thanks in advance for the help. I am using Python regular expressions to extract a part from a text which has the following layout:

(A lot of information)

time:    150

C-FXY

-- information ---

E-END

(A lot of information)

time:   5000

C-FXY

**--- INFORMATION I WANT TO EXTRACT ---**

E-END

(A lot of information)

time:  13000

C-FXY

-- information ---

E-END

(A lot of information)

I need to extract everything between C-FXY and E-END from the time step corresponding to 5000. For that I am using the following Python 3.6 sentence:

time_step = '5000'
text_part = re.search(r'time.*'+time_step+'.*C-FXY(.*?)E-END', text, re.DOTALL).group(1)

Unfortunately what I am getting on the output is that same edition between C-FXY and E-END but from the 13000 time step of the text, not the one I want from time: 5000.

Any help would be much appreciated. :)

Answer 1

The error is caused because your regex contains a greedy .* between the time part and the C-FXY one. So it eats everything up to the last group.

It should be enough to use a non greedy version here:

text_part = re.search(r'time.*'+time_step+'.*?C-FXY(.*?)E-END', text, re.DOTALL).group(1)

Anyway, I would not use a multiline search of the whole file here, but I would just read the file line by line up to the time: 5000 , then up to the C-FXY one, store anything from there up to a C-END one, and end processing there.

Answer 2

You can solve it using the following code:

import re

text = """(A lot of information)

time:    150

C-FXY

-- information ---

E-END

(A lot of information)

time:   5000

C-FXY

**--- INFORMATION I WANT TO EXTRACT ---**

E-END

(A lot of information)

time:  13000

C-FXY

-- information ---

E-END

(A lot of information)"""

pattern = re.compile(r"C-FXY(.*?)E-END")

results = re.findall(r"C-FXY(.*?)E-END", text, re.DOTALL)

Now, if you print the results :

for i, r in enumerate(results):
    print(f"Resultado {i}:\n'{r}'")

The output would be:

Resultado 0:
'

-- information ---

'
Resultado 1:
'

**--- INFORMATION I WANT TO EXTRACT ---**

'
Resultado 2:
'

-- information ---

'

Regex in Python: extract a multiline part from a text with repeating similar editions

Question

2 answers

solution1
0 2017-11-10 08:43:16

solution2
0 ACCPTED 2017-11-23 07:30:00

Regex in Python: extract a multiline part from a text with repeating similar editions

Question

2 answers

solution1 0 2017-11-10 08:43:16

solution2 0 ACCPTED 2017-11-23 07:30:00

solution1
0 2017-11-10 08:43:16

solution2
0 ACCPTED 2017-11-23 07:30:00