how to get the shortest matching with python (complex non-greedy pattern)

Question

I am trying to get the shortest matching of the pattern '''.*?''' is a [[.*?]] for sentences such as

'''fermentation starter''' is a preparation to assist the beginning of the [[fermentation (biochemistry)|fermentation]]. A '''starter culture''' is a [[microbiological culture]]

which contains the target string

 '''starter culture''' is a [[microbiological culture]]

The idea is to get the later string. To do so, I am using the following python code:

regex = re.compile("'''.*?''' is a \[\[.*?\]\]")
re.findall(regex, line)

However, I am getting the full sentence instead of the shortest pattern. Note that I have added '?' after the qualifier to make the match perform in a non-greedy fashion. Also I can solve it using

re.findall(regex, line[30:])

in order to escape the first occurrence of '''.*?''' , but I am looking for a more natural solution.

Answer 1

You can use this lookahead based regex:

>>> print re.findall(r"'''(?:(?!''').)*''' is a \[\[.*?\]\]", line)
["'''starter culture''' is a [[microbiological culture]]"]

(?:(?!''').)* will match 0 or more of any character that does not have ''' at next position thus making sure to match shortest match between two ''' .

RegEx Demo

Answer 2

If you're sure that you will not have '[' inside ''' ''' a simple solution is this:

regex = re.compile("'''[^[]*?''' is a \[\[.*?\]\]")
regex.findall(line)

Or you could do the same thing but with ' :

regex = re.compile("'''[^']*''' is a \[\[.*?\]\]")
regex.findall(line)

how to get the shortest matching with python (complex non-greedy pattern)

Question

2 answers

solution1
2 ACCPTED 2016-01-20 17:35:34

solution2
0 2016-01-20 17:51:30

how to get the shortest matching with python (complex non-greedy pattern)

Question

2 answers

solution1 2 ACCPTED 2016-01-20 17:35:34

solution2 0 2016-01-20 17:51:30

solution1
2 ACCPTED 2016-01-20 17:35:34

solution2
0 2016-01-20 17:51:30