edit: This part was solved, but I have one last problem with my code, see last answer.
I have a textfile structured as follow:
Name1 (Middlename1) LastName
Birthyear
Name2 (Middlename2) LastName
Birthyear
...
NameN (MiddlenameM) LastName
Birthyear
I'm trying to use RE to find the name and the year automatically but I don't know how to combine the two REs since both information are not on the same line:
import re
regexp = re.compile( r'(( )*)(?P<name>([a-zA-Z]*)( [a-zA-Z]+)? ROCHE)\n'
r'(( )*)(?P<year>18\d\d)\n'
)
The two REs are working independently but not together. How am I supposed to do it?
You want one regex that scans a string that spans across two lines. You then want to find successive matches. But first:
Names, at least in English-speaking countries, can contain hyphens (Anne-Marie), apostrophes (O'Donnell), periods (John Q. Public), etc. So I am using a regular expression that allows these characters. Also, people may have more than one middle name. What I am trying to illustrate is how to iterate through name/year pairs; you can customize the actual regex to suit your own particular requirements.
Regex:
^(?P<name>(?:[a-z.'-]+(?:\s+[a-z.'-]+)*))\n(?P<year>\d{4})$ Flags: re.M|re.I
^
Matches the start start of a line. [az.'-]+
Matches one or more alpha, period, ', or - characters. This is a name element .(?:\\s+[az.'-]+)*
Matches one or more white space characters followed by name element . This is repeated 0 or more times. Thus the named group name consists of 1 or more name elements separated by one or more white space characters.\\n
Matches a newline. (?P<year>\\d{4})$
Matches 4 digits followed by the end of line or the end of string. The MULTILINE flag treats the ^
and $
anchors special so that they match in addition to the start and end of string, the start and end of a line.
The code relies on re.finditer
to find successive matches:
import re
text = """John Doe
1921
John Q. Public
1987
Anne-Marie Smith
1989
Paul O'Donnell
2001
J. P. Marquand
1893
"""
regexp = re.compile(r"^(?P<name>(?:[a-z.'-]+(?:\s+[a-z.'-]+)*))\n(?P<year>\d{4})$", flags=re.M|re.I)
for m in regexp.finditer(text):
name = m['name']
year = m['year']
# do something with name and year in the second file. Here we are just printing the values.
print(name, year)
Prints:
John Doe 1921
John Q. Public 1987
Anne-Marie Smith 1989
Paul O'Donnell 2001
J. P. Marquand 1893
You should use re.MULTILINE
r = re.compile(r'(( )*)(?P<name>([a-zA-Z]*)( [a-zA-Z]+)?)(( )+)(?P<surname>([a-zA-Z]+))\n(?P<year>18\d\d)', re.MULTILINE)
m = r.match("""Jan Sebastian Bach
1892""")
Update #1 More complete example with reading two lines from file and then another two lines.
import re
r = re.compile(r'(( )*)(?P<name>([a-zA-Z]*)( [a-zA-Z]+)?)(( )+)(?P<surname>([a-zA-Z]+))\n(?P<year>18\d\d)', re.MULTILINE)
with open('people.txt') as f:
while True:
line1 = f.readline()
line2 = f.readline()
if not line2: break
m = r.match(line1+line2)
print("name:%s, surname:%s, year:%s" % (m.group('name'), m.group('surname'), m.group('year')))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.