Consider the following string
text2 = '''
Mr. Schafer
Mr Smith
Ms Davis
Mrs. Robinson
Mr. T
'''
I want regex to match the complete name, as in 'Mr. Schafer' for example
Using finditer():
matches = re.finditer(r'(Mr|Ms|Mrs)\.?\s[A-Z]\w*', text2)
for match in matches:
print(match)
Results:
<_sre.SRE_Match object; span=(1, 12), match='Mr. Schafer'>
<_sre.SRE_Match object; span=(13, 21), match='Mr Smith'>
<_sre.SRE_Match object; span=(22, 30), match='Ms Davis'>
<_sre.SRE_Match object; span=(31, 44), match='Mrs. Robinson'>
<_sre.SRE_Match object; span=(45, 50), match='Mr. T'>
finditer() gives me the results I want, but not in a list.
But when I use findall():
re.findall(r'(Mr|Ms|Mrs)\.?\s[A-Z]\w*', text2)
Results:
['Mr', 'Mr', 'Ms', 'Mrs', 'Mr']
Why is this? How can I get the result I want using findall()
I want this result:
['Mr. Schafer', 'Mr Smith', 'Ms Davis', 'Mrs. Robinson', 'Mr. T']
The list returned by re.findall
contains:
A capture is a part of the regular expression surrounded by parentheses, unless you use (?:...)
; the ?:
in this context tells Python's regex library to not consider the parentheses as defining a capture. (It's still used for grouping of course.)
So the simplest (and probably fastest) solution is to make sure the regex has no captures, by using (?:...)
to surround the title rather than just (...)
:
>>> re.findall(r'(?:Mr|Ms|Mrs)\.?\s[A-Z]\w*', text2)
['Mr. Schafer', 'Mr Smith', 'Ms Davis', 'Mrs. Robinson', 'Mr. T']
You could also explicitly capture the complete name:
>>> re.findall(r'((?:Mr|Ms|Mrs)\.?\s[A-Z]\w*)', text2)
['Mr. Schafer', 'Mr Smith', 'Ms Davis', 'Mrs. Robinson', 'Mr. T']
There's not much point doing that in this case, but the "one capture" form can be useful if you want to part of the pattern to not show up in the output.
Finally, you might want both the honorific and the surname in a tuple:
>>> re.findall(r'(?:(Mr|Ms|Mrs)\.?\s([A-Z]\w*))', text2)
[('Mr', 'Schafer'), ('Mr', 'Smith'), ('Ms', 'Davis'), ('Mrs', 'Robinson'), ('Mr', 'T')]
"()" part is a capture indicator.
add "?:" to set non-capturing.
import re
text2 = '''
Mr. Schafer
Mr Smith
Ms Davis
Mrs. Robinson
Mr. T
'''
print(re.findall(r"(?:Mr|Ms|Mrs)\.?\s[A-Za-z]*w*", text2))
# ['Mr. Schafer', 'Mr Smith', 'Ms Davis', 'Mrs. Robinson', 'Mr. T']
https://regexr.com/ has a cheatsheet on the left side.
I prefer finditer
over findall
. finditer
returns iterator of matched objects in the text while findall
returns list of matched patterns in text. For effectiveness generators are better than list as list all the reads data into memory while tier does not. To get the values from iterator
just use .group()
.
import re
text2 = '''
Mr. Schafer
Mr Smith
Ms Davis
Mrs. Robinson
Mr. T
'''
matches = re.finditer(r'(Mr|Ms|Mrs)\.?\s[A-Z]\w*', text2)
match_list = [match.group() for match in matches]
print(match_list)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.