I am trying to get the shortest matching of the pattern '''.*?''' is a [[.*?]]
for sentences such as
'''fermentation starter''' is a preparation to assist the beginning of the [[fermentation (biochemistry)|fermentation]]. A '''starter culture''' is a [[microbiological culture]]
which contains the target string
'''starter culture''' is a [[microbiological culture]]
The idea is to get the later string. To do so, I am using the following python code:
regex = re.compile("'''.*?''' is a \[\[.*?\]\]")
re.findall(regex, line)
However, I am getting the full sentence instead of the shortest pattern. Note that I have added '?' after the qualifier to make the match perform in a non-greedy fashion. Also I can solve it using
re.findall(regex, line[30:])
in order to escape the first occurrence of '''.*?'''
, but I am looking for a more natural solution.
You can use this lookahead based regex:
>>> print re.findall(r"'''(?:(?!''').)*''' is a \[\[.*?\]\]", line)
["'''starter culture''' is a [[microbiological culture]]"]
(?:(?!''').)*
will match 0 or more of any character that does not have '''
at next position thus making sure to match shortest match between two '''
.
If you're sure that you will not have '[' inside ''' '''
a simple solution is this:
regex = re.compile("'''[^[]*?''' is a \[\[.*?\]\]")
regex.findall(line)
Or you could do the same thing but with '
:
regex = re.compile("'''[^']*''' is a \[\[.*?\]\]")
regex.findall(line)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.