简体   繁体   中英

How to fix non-greedy regular expression

word = "\W*?[^,\t ]*?\W*?"
quotedSelectedWord = "\W*?\"(.*?)\"\W*?"
leftCurlyBrace = "\W*?\{\W*?"
rightCurlyBrace = "\W*?\}\W*?"
expression = leftCurlyBrace + word + "," + quotedSelectedWord

p = re.compile(expression)

for line in sourceFileList:
    line = line.strip()
    if (p.match(line)):         
        temp1 = p.sub(r"\1", line);
        print "temp1 = " + temp1 + "\n"

If the first line is (no actual single quotes): '{_blah_blah, "blah-blah", "blah blah blah", false, false, {_blah}, ""},'

why is temp1 = 'blah-blah, "blah blah blah", false, false},'?

I thought it would be equivalent to the first "group" enclosed in parentheses, which I thought would be 'blah-blah'.

The regular expression finds the pattern not once but twice.

The first one it finds is:

{_blah_blah, "blah-blah"

In which case group(1) (the part you put in parentheses above) is blah-blah , as you determined, which it uses to replace the first part of the string.

But it finds the pattern here too:

, {_blah}, ""

Here group(1) , which is still looking for .*? , is an empty string. So it replaces that part of the string with nothing, effectively removing it.

This site helped me sort this out.

Here's a site that shows both of these matches being found:

And a link to it with the regex in place.

在此输入图像描述

Update

This website is even more helpful in parsing the regex: http://regex101.com/#python

At this site, enter the regular expression. An important point is to enter the g modifier to the right of it to get all the matches. Next enter the Test string, and Substitution of \\1 . It already shows you the matches and substitutions. So that's good. Now on the left click "regex debugger".

在此输入图像描述

If you expand this section you'll be able to see exactly how it found the 2 matches:

在此输入图像描述

The python documentation states for re.sub(pattern, repl, string, count=0, flags=0)

Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl.

If we rewrite the for loop a bit:

for line in sourceFileList:
    line = line.strip()
    match = p.match(line)
    if (match):
        print "whole match = " + match.group()
        print "first group = " + match.group(1)
        temp1 = p.sub(r"\1", line)
        print "temp1 = " + temp1 + "\n"

we get the output:

whole match = {_blah_blah, "blah-blah"
first group = blah-blah
temp1 = blah-blah, "blah blah blah", false, false},

So this means that {_blah_blah, "blah-blah" will be replaced by blah-blah in your original string which still contains , "blah blah blah", false, false, {_blah}, ""}, at the end.

If you just want to get the first capture group you can use group(1) as demonstrated above.

Edit :

As pointed out by the answer of twasbrillig, there are two replacements. If re.sub is called with count = 0 or omitting the count parameter all occurrences of pattern are replaced and not only the first one.


Side note: I recommend to use raw strings in your patterns:

word = r"\W*?[^,\t ]*?\W*?"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM