简体   繁体   中英

python how to extract the text between two known words in a string?

how to extract the text between two known words in a string with a condition that the text between these words can be i) 1 character ii) 1 word iii) 2 words etc.?

Sample Text:

text = ("MNOTES - GEO GEO MNOTES 20 231-0005 GEO GEO GEO GEO GEO MNOTES SOME REVISION MNOTES CASUAL C GEO GEO GEO GEO GEO MNOTES F232322500 MNOTES HELP PAGES GEO GEO GEO GEO MNOTES SHEET 1 OF 3 GEO GEO MNOTES CASUAL E. GEO GEO MNOTES SITPOPE/TIN AY GEO GEO MNOTES R GEO GEO GEO GEO MNOTES 22+0436/T.SKI/11-AUG-1986 GEO GEO GEO GEO MNOTES 231-0045 GEO")

I have a string like above that have multiple occurrences of these two known words 'MNOTES' and 'GEO' , however the text between them can be anything and any number of words.

I wanted to extract sometimes the text that has only one character between those two known words or sometimes the text that has 2 words between those two known words or sometimes the text that has 6 words between those two known words etc., So, how can i extract along with the condition ?

Use re.findall .

import re

re.findall('MNOTES(.*?)GEO', text)

This results in:

[' - ', ' 20 231-0005 ', ' SOME REVISION MNOTES CASUAL C ', ' F232322500 MNOTES HELP PAGES ', ' SHEET 1 OF 3 ', ' CASUAL E. ', ' SITPOPE/TIN AY ', ' R ', ' 22+0436/T.SKI/11-AUG-1986 ', ' 231-0045 ']

Edit

To get a specific amount of characters the following will work:

re.findall('MNOTES\s?(.{1})\s?GEO', text)

Results in

['-', 'R']

and to get only results that are 6-8 characters long:

re.findall('MNOTES\s?(.{6,8})\s?GEO', text)

Results:

['- GEO ', 'CASUAL C', 'R GEO ', '231-0045']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM