简体   繁体   English

如何修复非贪婪的正则表达式

[英]How to fix non-greedy regular expression

word = "\W*?[^,\t ]*?\W*?"
quotedSelectedWord = "\W*?\"(.*?)\"\W*?"
leftCurlyBrace = "\W*?\{\W*?"
rightCurlyBrace = "\W*?\}\W*?"
expression = leftCurlyBrace + word + "," + quotedSelectedWord

p = re.compile(expression)

for line in sourceFileList:
    line = line.strip()
    if (p.match(line)):         
        temp1 = p.sub(r"\1", line);
        print "temp1 = " + temp1 + "\n"

If the first line is (no actual single quotes): '{_blah_blah, "blah-blah", "blah blah blah", false, false, {_blah}, ""},' 如果第一行是(没有实际的单引号):'{_ blah_blah,'blah-blah“,”blah blah blah“,false,false,{_blah},”“},'

why is temp1 = 'blah-blah, "blah blah blah", false, false},'? 为什么temp1 ='blah-blah,'blah blah blah“,false,false},'?

I thought it would be equivalent to the first "group" enclosed in parentheses, which I thought would be 'blah-blah'. 我认为这相当于括号中的第一个“组”,我认为这将是“等等”。

The regular expression finds the pattern not once but twice. 正则表达式查找模式不是一次而是两次。

The first one it finds is: 它找到的第一个是:

{_blah_blah, "blah-blah"

In which case group(1) (the part you put in parentheses above) is blah-blah , as you determined, which it uses to replace the first part of the string. 在这种情况下, group(1)你把括号以上的部分)是blah-blah ,因为你确定的,它用来替换的字符串的第一部分。

But it finds the pattern here too: 但它也在这里找到了模式:

, {_blah}, ""

Here group(1) , which is still looking for .*? 这里group(1) ,仍在寻找.*? , is an empty string. ,是一个空字符串。 So it replaces that part of the string with nothing, effectively removing it. 所以它没有任何东西替换字符串的那部分,有效地删除它。

This site helped me sort this out. 这个网站帮我解决了这个问题。

Here's a site that shows both of these matches being found: 这是一个显示以下匹配项的网站

And a link to it with the regex in place. 与正则表达式的链接到位。

在此输入图像描述

Update 更新

This website is even more helpful in parsing the regex: http://regex101.com/#python 这个网站在解析正则表达式时更有帮助: http//regex101.com/#python

At this site, enter the regular expression. 在此站点上,输入正则表达式。 An important point is to enter the g modifier to the right of it to get all the matches. 重要的一点是在其右侧输入g修饰符以获得所有匹配。 Next enter the Test string, and Substitution of \\1 . 接下来输入测试字符串和\\1替换。 It already shows you the matches and substitutions. 它已经显示了匹配和替换。 So that's good. 这很好。 Now on the left click "regex debugger". 现在在左侧单击“正则表达式调试器”。

在此输入图像描述

If you expand this section you'll be able to see exactly how it found the 2 matches: 如果您展开此部分,您将能够确切地看到它如何找到2个匹配项:

在此输入图像描述

The python documentation states for re.sub(pattern, repl, string, count=0, flags=0) python文档声明了re.sub(pattern, repl, string, count=0, flags=0)

Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. 返回通过替换repl替换字符串中最左边的非重叠模式而获得的字符串。

If we rewrite the for loop a bit: 如果我们重写一下for循环:

for line in sourceFileList:
    line = line.strip()
    match = p.match(line)
    if (match):
        print "whole match = " + match.group()
        print "first group = " + match.group(1)
        temp1 = p.sub(r"\1", line)
        print "temp1 = " + temp1 + "\n"

we get the output: 我们得到输出:

whole match = {_blah_blah, "blah-blah"
first group = blah-blah
temp1 = blah-blah, "blah blah blah", false, false},

So this means that {_blah_blah, "blah-blah" will be replaced by blah-blah in your original string which still contains , "blah blah blah", false, false, {_blah}, ""}, at the end. 所以这意味着{_blah_blah, "blah-blah"将被原始字符串中的blah-blah所取代,其中仍包含, "blah blah blah", false, false, {_blah}, ""},最后。

If you just want to get the first capture group you can use group(1) as demonstrated above. 如果您只想获得第一个捕获组,可以使用如上所示的group(1)

Edit : 编辑

As pointed out by the answer of twasbrillig, there are two replacements. 正如twasbrillig的回答所指出的那样,有两个替代品。 If re.sub is called with count = 0 or omitting the count parameter all occurrences of pattern are replaced and not only the first one. 如果使用count = 0调用re.sub或省略count参数,则替换所有出现的模式,而不仅仅是第一个模式。


Side note: I recommend to use raw strings in your patterns: 旁注:我建议在您的模式中使用原始字符串:

word = r"\W*?[^,\t ]*?\W*?"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM