Over-matching with regular expression function re.findall() in Python

Question

I am using Python 2 and I have a string as follows.

s = """
f = function(x) sum(is.na(x))

apply(xdat, 2, f)

sum_it = function(xdat) {
    ans = apply(xdat, 2, sum)
    return(ans)
}
"""

I intend to extract the sum_it function that is enclosed by {} . I used the code below

print(re.findall(r"\s+[\w._]+ = function\(.+?\)\s*{.+?\n}\s", s, flags=re.DOTALL)[0])

which gives me the wrong result:

f = function(x) sum(is.na(x))

apply(xdat, 2, f)

sumit = function(xdat) {
    ans = apply(xdat, 2, sum)
    return(ans)
}

I made it explicitly clear that the substring I want should include {} in there. But why does this regular expression fail to exclude the beginning part of the string that clearly does not have {} in there? How can I get this:

sumit = function(xdat) {
    ans = apply(xdat, 2, sum)
    return(ans)
}

Answer 1

It's the wildcard match in the parentheses. Since you're using "." to match the argument list, it can also match parentheses characters. The first match includes everything between the first open paren and the last close paren. If you change the line to this, it should work (although you get a lot of extra space at the start, and there are better ways to do this match):

print(re.findall(r"\s+[\w._]+ = function\([^)]*\)\s*{.+?\n}\s", s, flags=re.DOTALL)[0])

Update after discussion:

At first glance it would seem that the non-greedy qualifier would look for the matching substring with the fewest characters. The way it really works, though, is that it seeks the first substring that can be a match, and then matches as few characters as possible within that.

Over-matching with regular expression function re.findall() in Python

Question

1 answers

solution1
1 ACCPTED 2018-04-22 04:10:52

Over-matching with regular expression function re.findall() in Python

Question

1 answers

solution1 1 ACCPTED 2018-04-22 04:10:52

solution1
1 ACCPTED 2018-04-22 04:10:52