简体   繁体   中英

Over-matching with regular expression function re.findall() in Python

I am using Python 2 and I have a string as follows.

s = """
f = function(x) sum(is.na(x))

apply(xdat, 2, f)

sum_it = function(xdat) {
    ans = apply(xdat, 2, sum)
    return(ans)
}
"""

I intend to extract the sum_it function that is enclosed by {} . I used the code below

print(re.findall(r"\s+[\w._]+ = function\(.+?\)\s*{.+?\n}\s", s, flags=re.DOTALL)[0])

which gives me the wrong result:

f = function(x) sum(is.na(x))

apply(xdat, 2, f)

sumit = function(xdat) {
    ans = apply(xdat, 2, sum)
    return(ans)
}

I made it explicitly clear that the substring I want should include {} in there. But why does this regular expression fail to exclude the beginning part of the string that clearly does not have {} in there? How can I get this:

sumit = function(xdat) {
    ans = apply(xdat, 2, sum)
    return(ans)
}

It's the wildcard match in the parentheses. Since you're using "." to match the argument list, it can also match parentheses characters. The first match includes everything between the first open paren and the last close paren. If you change the line to this, it should work (although you get a lot of extra space at the start, and there are better ways to do this match):

print(re.findall(r"\s+[\w._]+ = function\([^)]*\)\s*{.+?\n}\s", s, flags=re.DOTALL)[0])

Update after discussion:

At first glance it would seem that the non-greedy qualifier would look for the matching substring with the fewest characters. The way it really works, though, is that it seeks the first substring that can be a match, and then matches as few characters as possible within that.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM