I'm still relatively new to regex. I'm trying to find the shortest string of text that matches a particular pattern, but am having trouble if the shortest pattern is a substring of a larger match. For example:
import re
string = "A|B|A|B|C|D|E|F|G"
my_pattern = 'a.*?b.*?c'
my_regex = re.compile(my_pattern, re.DOTALL|re.IGNORECASE)
matches = my_regex.findall(string)
for match in matches:
print match
prints:
A|B|A|B|C
but I'd want it to return:
A|B|C
Is there a way to do this without having to loop over each match to see if it contains a substring that matches?
Contrary to most other answers here, this can be done in a single regex using a positive lookahead assertion with a capturing group :
>>> my_pattern = '(?=(a.*?b.*?c))'
>>> my_regex = re.compile(my_pattern, re.DOTALL|re.IGNORECASE)
>>> matches = my_regex.findall(string)
>>> print min(matches, key=len)
A|B|C
findall()
will return all possible matches, so you need min()
to get the shortest one.
How this works:
No. Perl returns the longest, leftmost match, while obeying your non-greedy quantifiers. You'll have to loop, I'm afraid.
Edit: Yes, I realize I said Perl above, but I believe it is true for Python.
This might be a useful application of sexegers . Regular-expression matching is biased toward the longest, leftmost choice. Using non-greedy quantifiers such as in .*?
skirts the longest part, and reversing both the input and pattern can get around leftmost-matching semantics.
Consider the following program that outputs A|B|C
as desired:
#! /usr/bin/env python
import re
string = "A|B|A|B|C|D|E|F|G"
my_pattern = 'c.*?b.*?a'
my_regex = re.compile(my_pattern, re.DOTALL|re.IGNORECASE)
matches = my_regex.findall(string[::-1])
for match in matches:
print match[::-1]
Another way is to make a stricter pattern. Say you don't want to allow repetitions of characters already seen:
my_pattern = 'a[^a]*?b[^ab]*?c'
Your example is generic and contrived, but if we had a better idea of the inputs you're working with, we could offer better, more helpful suggestions.
Another regex solution; it finds only the last occurence of .*a.*b.*c:
my_pattern = 'a(?!.*a.*b.*c).*b[^c]*c'
a(?!.*a.*?b.*?c)
ensures that there is no 'a.*?b.*?c'
after first 'A' strings like A|A|B|C or A|B|A|B|C or A|B|C|A|B|C in results are eliminated
b[^c]*c
ensures that after 'B' there is only one 'C' strings like A|B|C|B|C or A|B|C|C in results are eliminated
So you have the smallest matching 'a.*?b.*?c'
The regex engine starts searching from the beginning of the string till it finds a match and then exits. Thus if it finds a match before it even considers the smaller one, there is no way for you to force it to consider later matches in the same run - you will have to rerun the regex on substrings.
Setting the global flag and choosing the shortest matched string won't help as it is evident from your example - the shorter match might be a substring of another match (or partly included in it). I believe you will have to start subsequent searches from (1 + index of previous match) and go on like that.
I do not think that this task can be accomplished by a single regular expression. I have no proof that this is the case, but there are quite a lot of things that can't be done with regexes and I expected this problem to be one of them. Some good examples of the limitations of regexes are given in this blog post .
You might be able to write the regex in such a way that it can't contain smaller matches.
For your regex:
a.*?b.*?c
I think you can write this:
a[^ab]*b[^c]*c
It's tricky to get that correct, and I don't see any more general or more obviously correct way to do it. (Edit—earlier I suggested a negative lookahead assertion, but I don't see a way to make that work.)
A Python loop to look for the shortest match, by brute force testing each substring from left to right, picking the shortest:
shortest = None
for i in range(len(string)):
m = my_regex.match(string[i:])
if m:
mstr = m.group()
if shortest is None or len(mstr) < len(shortest):
shortest = mstr
print shortest
Another loop, this time letting re.findall do the hard work of searching for all possible matches, then brute force testing each match right-to-left looking for a shorter substring:
# find all matches using findall
matches = my_regex.findall(string)
# for each match, try to match right-hand substrings
shortest = None
for m in matches:
for i in range(-1,-len(m),-1):
mstr = m[i:]
if my_regex.match(mstr):
break
else:
mstr = m
if shortest is None or len(mstr) < len(shortest):
shortest = mstr
print shortest
No, there isn't in the Python regular expression engine.
My take for a custom function, though:
import re, itertools
# directly from itertools recipes
def pairwise(iterable):
"s -> (s0,s1), (s1,s2), (s2, s3), ..."
a, b = itertools.tee(iterable)
for elem in b:
break
return itertools.izip(a, b)
def find_matches(rex, text):
"Find all matches, even overlapping ones"
matches= list(rex.finditer(text))
# first produce typical matches
for match in matches:
yield match.group(0)
# next, run it for any patterns included in matches
for match1, match2 in pairwise(matches):
subtext= text[match1.start()+1:match2.end()+1]
for result in find_matches(rex, subtext):
yield result
# also test the last match, if there was at least one
if matches:
subtext= text[matches[-1].start()+1:matches[-1].end()+1]
# perhaps the previous "matches[-1].end()+1" can be omitted
for result in find_matches(rex, subtext):
yield result
def shortest_match(rex, text):
"Find the shortest match"
return min(find_matches(rex, text), key=len)
if __name__ == "__main__":
pattern= re.compile('a.*?b.*?c', re.I)
searched_text= "A|B|A|B|C|D|E|F|G"
print (shortest_match(pattern, searched_text))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.