I'd like to get strings from text where the strings are between two given other strings - but both of these two latter strings are partly defined with regex expressions also.
So, from the following lines:
ALPHA101BETAsomething1GAMMA532DELTA
ALPHA231BETAsomething2GAMMA555DELTA
ALPHA341BETAagainsomethingsomethingGAMMA998DELTA
I'd like to get the following:
something1
something2
againsomething
My problem here is that I cannot define the opening and closing expressions so that these are something plus a three-digit-expressions plus again something again.
So far I tried but failed with this:
re.findall("ALPHA(?:\d\.){3}BETA(.*?)GAMMA(?:\d\.){3}DELTA", pagetext)
How could I instruct the parser that a given regex match group is not the desired result but part of the opening/closing strings?
I modified the regex a little bit and now it works for me. You can use re.compile, re.search, and re.group to get the specific substring you were looking for:
import re
REGEX = re.compile(r'ALPHA(\d){3}BETA(.*?)GAMMA(\d){3}DELTA')
# The next part is all about how your pagetext is formatted.
# if you have newlines in the pagetext:
for line in pagetext.split('\n'):
result = re.search(REGEX, line)
your_desired_str = result.group(2)
# if you just want to read the text line by line from a file:
with open(yourfile) as infile:
for line in infile:
result = re.search(REGEX, line)
your_desired_str = result.group(2)
This will work for you:-
import re
text ='ALPHA101BETAsomething1GAMMA532DELTA\nALPHA231BETAsomething2GAMMA555DELTA\nALPHA341BETAagainsomethingsomethingGAMMA998DELTA'
for line in text.split('\n'):
print re.findall(r'ALPHA+\d+BETA(.*?)GAMMA+\d+DELTA',line)[0]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.