简体   繁体   English

与Python中的正则表达式函数re.findall()过度匹配

[英]Over-matching with regular expression function re.findall() in Python

I am using Python 2 and I have a string as follows. 我正在使用Python 2,并且具有如下字符串。

s = """
f = function(x) sum(is.na(x))

apply(xdat, 2, f)

sum_it = function(xdat) {
    ans = apply(xdat, 2, sum)
    return(ans)
}
"""

I intend to extract the sum_it function that is enclosed by {} . 我打算提取{}包围的sum_it函数。 I used the code below 我用下面的代码

print(re.findall(r"\s+[\w._]+ = function\(.+?\)\s*{.+?\n}\s", s, flags=re.DOTALL)[0])

which gives me the wrong result: 这给了我错误的结果:

f = function(x) sum(is.na(x))

apply(xdat, 2, f)

sumit = function(xdat) {
    ans = apply(xdat, 2, sum)
    return(ans)
}

I made it explicitly clear that the substring I want should include {} in there. 我明确表示,我想要的子字符串应在其中包含{} But why does this regular expression fail to exclude the beginning part of the string that clearly does not have {} in there? 但是,为什么这个正则表达式不能排除明显没有{}的字符串的开始部分? How can I get this: 我怎么能得到这个:

sumit = function(xdat) {
    ans = apply(xdat, 2, sum)
    return(ans)
}

It's the wildcard match in the parentheses. 这是括号中的通配符匹配。 Since you're using "." 由于您使用的是“。” to match the argument list, it can also match parentheses characters. 为了匹配参数列表,它还可以匹配括号字符。 The first match includes everything between the first open paren and the last close paren. 第一个匹配项包括第一个打开的括号和最后一个关闭的括号之间的所有内容。 If you change the line to this, it should work (although you get a lot of extra space at the start, and there are better ways to do this match): 如果将行更改为此,它应该可以工作(尽管一开始会获得很多额外的空间,并且有更好的方法进行此匹配):

print(re.findall(r"\s+[\w._]+ = function\([^)]*\)\s*{.+?\n}\s", s, flags=re.DOTALL)[0])

Update after discussion: 讨论后更新:

At first glance it would seem that the non-greedy qualifier would look for the matching substring with the fewest characters. 乍一看,非贪婪的限定词似乎会寻找字符最少的匹配子字符串。 The way it really works, though, is that it seeks the first substring that can be a match, and then matches as few characters as possible within that. 但是,它真正起作用的方式是,它会寻找第一个可以匹配的子字符串,然后在其中匹配尽可能少的字符。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM