简体   繁体   中英

Regex in Python: extract a multiline part from a text with repeating similar editions

Thanks in advance for the help. I am using Python regular expressions to extract a part from a text which has the following layout:

(A lot of information)

time:    150

C-FXY

-- information ---

E-END

(A lot of information)

time:   5000

C-FXY

**--- INFORMATION I WANT TO EXTRACT ---**

E-END

(A lot of information)

time:  13000

C-FXY

-- information ---

E-END

(A lot of information)

I need to extract everything between C-FXY and E-END from the time step corresponding to 5000. For that I am using the following Python 3.6 sentence:

time_step = '5000'
text_part = re.search(r'time.*'+time_step+'.*C-FXY(.*?)E-END', text, re.DOTALL).group(1)

Unfortunately what I am getting on the output is that same edition between C-FXY and E-END but from the 13000 time step of the text, not the one I want from time: 5000.

Any help would be much appreciated. :)

The error is caused because your regex contains a greedy .* between the time part and the C-FXY one. So it eats everything up to the last group.

It should be enough to use a non greedy version here:

text_part = re.search(r'time.*'+time_step+'.*?C-FXY(.*?)E-END', text, re.DOTALL).group(1)

Anyway, I would not use a multiline search of the whole file here, but I would just read the file line by line up to the time: 5000 , then up to the C-FXY one, store anything from there up to a C-END one, and end processing there.

You can solve it using the following code:

import re

text = """(A lot of information)

time:    150

C-FXY

-- information ---

E-END

(A lot of information)

time:   5000

C-FXY

**--- INFORMATION I WANT TO EXTRACT ---**

E-END

(A lot of information)

time:  13000

C-FXY

-- information ---

E-END

(A lot of information)"""

pattern = re.compile(r"C-FXY(.*?)E-END")

results = re.findall(r"C-FXY(.*?)E-END", text, re.DOTALL)

Now, if you print the results :

for i, r in enumerate(results):
    print(f"Resultado {i}:\n'{r}'")

The output would be:

Resultado 0:
'

-- information ---

'
Resultado 1:
'

**--- INFORMATION I WANT TO EXTRACT ---**

'
Resultado 2:
'

-- information ---

'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM