简体   繁体   中英

re.findall gives different results than re.search with the same pattern

I have as str that I want to get the substring inside single quotes ( ' ):

line = "This is a 'car' which has a 'person' in it!"

so I used:

name = re.findall("\'(.+?)\'", line)
print(name[0])
print(name[1])

car
person

But when I try this approach:

pattern = re.compile("\'(.+?)\'")
matches = re.search(pattern, line)
print(matches.group(0))
print(matches.group(1))
# print(matches.group(2))  # <- this produces an error of course

'car'
car

So, my question is why the pattern behaves differently in each case? I know that the former returns "all non-overlapping matches of pattern in string" and the latter match objects which might explain some difference but I would expect with the same pattern same results (even in different format).

So, to make it more concrete:

  1. In the first case with findall the pattern returns all substrings but in the latter case it only return the first substring.
  2. In the latter case matches.group(0) (which corresponds to the whole match according to the documentation) is different than matches.group(1) (which correspond to the first parenthesized subgroup). Why is that?

re.finditer("\\'(.+?)\\'", line) returns match objects so it functions like re.search .

I know that there are similar question is SO like this one or this one but they don't seem to answer why (or at least I did not get it).

You already read the docs and other answers, so I will give you a hands-on explanation

Let's first take this example from here

>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
>>> m.group(0)       # The entire match
'Isaac Newton'
>>> m.group(1)       # The first parenthesized subgroup.
'Isaac'
>>> m.group(2)       # The second parenthesized subgroup.
'Newton'
>>> m.group(1, 2)    # Multiple arguments give us a tuple.
('Isaac', 'Newton')

If you go on this website you will find the correspondence with the previous detections

第一个例子

group(0) is taking the full match, group(1) and group(2) are respectively Group 1 and Group 2 in the picture. Because as said here "Match.group([group1, ...]) Returns one or more subgroups of the match. If there is a single argument, the result is a single string; if there are multiple arguments, the result is a tuple with one item per argument. Without arguments, group1 defaults to zero (the whole match is returned)"

Now let's go back to your example

第二个例子

As said by others with re.search(pattern, line) you will find ONLY the first occurrence of the pattern ["Scan through string looking for the first location where the regular expression pattern produces a match" as said here ] and following the previous logic you will now understand why matches.group(0) will output the full match and matches.group(1) the Group 1. And you will understand why matches.group(2) is giving you error [because as you can see from the screenshot there is not a group 2 for the first occurrence in this last example]

  1. re.findall returns list of matches (in this particular example, first groups of matches), while re.search returns only first leftmost match.

    As stated in python documentation ( re.findall ):

    Return all non-overlapping matches of pattern in string, as a list of strings . The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result.

  2. matches.group(0) gives you whole fragment of string that matches your pattern, that's why it have quotes, while matches.group(1) gives you first parenthesized substring of matching fragment, that means it will not include quotes because they are outside of parentheses. Check Match.group() docs for more information.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM