简体   繁体   中英

Python regular expression, repeating data

This seems like a simple task but I have sunk enough time into this to finally ask for help:

I have a long text file in roughly this format:

Start of test xyz:

multiple lines of blah blah blah

Start of test wzy:

multiple lines of blah blah blah

Start of test qqq:

multiple lines of blah blah blah

I want to grab all the stuff after the "Start of test" deceleration, and this expression gets me about half of what I need:

re.findall(r'Start of test(.+?)Start of test', curfile, re.S)

The most obvious issue is I'm consuming the start of what I need to search for next, thus yielding approximately half of the results I wanted. Assuming I could avoid that I still can't figure out how to get the last chunk where there is no "Start of test" to end match to.

I assume I need to be using negative lookahead assertions, but I am not having much luck figuring out the proper way to use them, I've been trying stuff like:

re.findall(r'Start of test(.+?)(?!Start of test)

which gives no useful results.

I think this is the pattern you are looking for

Start of test(.+?)(?=Start of test|$)

Then your new code should be

re.findall(r'Start of test(.+?)Start of test', curfile, re.S)

see demo

You want a lookahead pattern. See https://docs.python.org/2/library/re.html where it describes (?= ... ) :

(?=...)
Matches if ... matches next, but doesn't consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it's followed by 'Asimov' .

So for your case:

re.findall(r'Start of test(.+?)(?=Start of test)', curfile, re.S)

But this will have to be tempered with a non-greedy evaluation.

It might be more useful to use re.finditer to get an iterable of match objects, and then use mo.start(0) on each match object to find out where in the original string the current match is. Then, you can recover everything in between matches in the following way -- notice that my pattern only matches a single "Start of test" line:

pattern = r'^Start of test (.*):$'
matches = re.finditer(pattern, curfile, re.M)
i = 0  # where the last match ended
names = []
in_between = []
for mo in matches:
    j = mo.start(0)
    in_between = curfile[i:j]  # store what came before this match
    i = mo.end(0)  # store the new "end of match" position
    names.append(mo.group(1))  # store the matched name
in_between.append(curfile[i:])  # store the rest of the file

# in_between[0] is what came before the first test
chunks = in_between[1:]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM