简体   繁体   中英

how to get the shortest matching with python (complex non-greedy pattern)

I am trying to get the shortest matching of the pattern '''.*?''' is a [[.*?]] for sentences such as

'''fermentation starter''' is a preparation to assist the beginning of the [[fermentation (biochemistry)|fermentation]]. A '''starter culture''' is a [[microbiological culture]]

which contains the target string

 '''starter culture''' is a [[microbiological culture]]

The idea is to get the later string. To do so, I am using the following python code:

regex = re.compile("'''.*?''' is a \[\[.*?\]\]")
re.findall(regex, line)

However, I am getting the full sentence instead of the shortest pattern. Note that I have added '?' after the qualifier to make the match perform in a non-greedy fashion. Also I can solve it using

re.findall(regex, line[30:])

in order to escape the first occurrence of '''.*?''' , but I am looking for a more natural solution.

You can use this lookahead based regex:

>>> print re.findall(r"'''(?:(?!''').)*''' is a \[\[.*?\]\]", line)
["'''starter culture''' is a [[microbiological culture]]"]

(?:(?!''').)* will match 0 or more of any character that does not have ''' at next position thus making sure to match shortest match between two ''' .

RegEx Demo

If you're sure that you will not have '[' inside ''' ''' a simple solution is this:

regex = re.compile("'''[^[]*?''' is a \[\[.*?\]\]")
regex.findall(line)

Or you could do the same thing but with ' :

regex = re.compile("'''[^']*''' is a \[\[.*?\]\]")
regex.findall(line)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM