简体   繁体   中英

Difference between re.findall() and re.finditer() when using groups in regex?

Consider the following string

text2 = '''
Mr. Schafer
Mr Smith
Ms Davis
Mrs. Robinson
Mr. T
'''

I want regex to match the complete name, as in 'Mr. Schafer' for example

Using finditer():

matches = re.finditer(r'(Mr|Ms|Mrs)\.?\s[A-Z]\w*', text2)
for match in matches:
    print(match)

Results:

<_sre.SRE_Match object; span=(1, 12), match='Mr. Schafer'>
<_sre.SRE_Match object; span=(13, 21), match='Mr Smith'>
<_sre.SRE_Match object; span=(22, 30), match='Ms Davis'>
<_sre.SRE_Match object; span=(31, 44), match='Mrs. Robinson'>
<_sre.SRE_Match object; span=(45, 50), match='Mr. T'>

finditer() gives me the results I want, but not in a list.

But when I use findall():

re.findall(r'(Mr|Ms|Mrs)\.?\s[A-Z]\w*', text2)

Results:

['Mr', 'Mr', 'Ms', 'Mrs', 'Mr']

Why is this? How can I get the result I want using findall()
I want this result:

['Mr. Schafer', 'Mr Smith', 'Ms Davis', 'Mrs. Robinson', 'Mr. T']

The list returned by re.findall contains:

  • the text of each match, if the regex has no captures
  • the text of the capture in each match, if the regex has exactly one capture
  • a tuple of substrings corresponding to each capture, if the regex has has more than one capture.

A capture is a part of the regular expression surrounded by parentheses, unless you use (?:...) ; the ?: in this context tells Python's regex library to not consider the parentheses as defining a capture. (It's still used for grouping of course.)

So the simplest (and probably fastest) solution is to make sure the regex has no captures, by using (?:...) to surround the title rather than just (...) :

>>> re.findall(r'(?:Mr|Ms|Mrs)\.?\s[A-Z]\w*', text2)
['Mr. Schafer', 'Mr Smith', 'Ms Davis', 'Mrs. Robinson', 'Mr. T']

You could also explicitly capture the complete name:

>>> re.findall(r'((?:Mr|Ms|Mrs)\.?\s[A-Z]\w*)', text2)
['Mr. Schafer', 'Mr Smith', 'Ms Davis', 'Mrs. Robinson', 'Mr. T']

There's not much point doing that in this case, but the "one capture" form can be useful if you want to part of the pattern to not show up in the output.

Finally, you might want both the honorific and the surname in a tuple:

>>> re.findall(r'(?:(Mr|Ms|Mrs)\.?\s([A-Z]\w*))', text2)
[('Mr', 'Schafer'), ('Mr', 'Smith'), ('Ms', 'Davis'), ('Mrs', 'Robinson'), ('Mr', 'T')]

"()" part is a capture indicator.

add "?:" to set non-capturing.

import re

text2 = '''
        Mr. Schafer
        Mr Smith
        Ms Davis
        Mrs. Robinson
        Mr. T
        '''
print(re.findall(r"(?:Mr|Ms|Mrs)\.?\s[A-Za-z]*w*", text2))
# ['Mr. Schafer', 'Mr Smith', 'Ms Davis', 'Mrs. Robinson', 'Mr. T']

https://regexr.com/ has a cheatsheet on the left side.

I prefer finditer over findall . finditer returns iterator of matched objects in the text while findall returns list of matched patterns in text. For effectiveness generators are better than list as list all the reads data into memory while tier does not. To get the values from iterator just use .group() .

import re

text2 = '''
Mr. Schafer
Mr Smith
Ms Davis
Mrs. Robinson
Mr. T
'''


matches = re.finditer(r'(Mr|Ms|Mrs)\.?\s[A-Z]\w*', text2)

match_list = [match.group() for match in matches]
print(match_list)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM