how to use custom regular expression for more than one call?

Question

edit: This part was solved, but I have one last problem with my code, see last answer.

I have a textfile structured as follow:

Name1 (Middlename1) LastName
Birthyear
Name2 (Middlename2) LastName
Birthyear
...
NameN (MiddlenameM) LastName
Birthyear

I'm trying to use RE to find the name and the year automatically but I don't know how to combine the two REs since both information are not on the same line:

import re
regexp = re.compile(  r'(( )*)(?P<name>([a-zA-Z]*)( [a-zA-Z]+)? ROCHE)\n'
                      r'(( )*)(?P<year>18\d\d)\n'
                   )

The two REs are working independently but not together. How am I supposed to do it?

Answer 1

You want one regex that scans a string that spans across two lines. You then want to find successive matches. But first:

Names, at least in English-speaking countries, can contain hyphens (Anne-Marie), apostrophes (O'Donnell), periods (John Q. Public), etc. So I am using a regular expression that allows these characters. Also, people may have more than one middle name. What I am trying to illustrate is how to iterate through name/year pairs; you can customize the actual regex to suit your own particular requirements.

Regex:

^(?P<name>(?:[a-z.'-]+(?:\s+[a-z.'-]+)*))\n(?P<year>\d{4})$  Flags: re.M|re.I

^ Matches the start start of a line.
[az.'-]+ Matches one or more alpha, period, ', or - characters. This is a name element .
(?:\\s+[az.'-]+)* Matches one or more white space characters followed by name element . This is repeated 0 or more times. Thus the named group name consists of 1 or more name elements separated by one or more white space characters.
\\n Matches a newline.
(?P<year>\\d{4})$ Matches 4 digits followed by the end of line or the end of string.

The MULTILINE flag treats the ^ and $ anchors special so that they match in addition to the start and end of string, the start and end of a line.

The code relies on re.finditer to find successive matches:

import re

text = """John Doe
1921
John Q. Public
1987
Anne-Marie Smith
1989
Paul O'Donnell
2001
J. P. Marquand
1893
"""

regexp = re.compile(r"^(?P<name>(?:[a-z.'-]+(?:\s+[a-z.'-]+)*))\n(?P<year>\d{4})$", flags=re.M|re.I)
for m in regexp.finditer(text):
    name = m['name']
    year = m['year']
    # do something with name and year in the second file. Here we are just printing the values.
    print(name, year)

Prints:

John Doe 1921
John Q. Public 1987
Anne-Marie Smith 1989
Paul O'Donnell 2001
J. P. Marquand 1893

Answer 2

You should use re.MULTILINE

r = re.compile(r'(( )*)(?P<name>([a-zA-Z]*)( [a-zA-Z]+)?)(( )+)(?P<surname>([a-zA-Z]+))\n(?P<year>18\d\d)', re.MULTILINE)

m = r.match("""Jan Sebastian  Bach
1892""")

Update #1 More complete example with reading two lines from file and then another two lines.

import re

r = re.compile(r'(( )*)(?P<name>([a-zA-Z]*)( [a-zA-Z]+)?)(( )+)(?P<surname>([a-zA-Z]+))\n(?P<year>18\d\d)', re.MULTILINE)

with open('people.txt') as f:
    while True:
        line1 = f.readline()
        line2 = f.readline()
        if not line2: break
        m = r.match(line1+line2)
        print("name:%s, surname:%s, year:%s" % (m.group('name'), m.group('surname'), m.group('year')))

how to use custom regular expression for more than one call?

Question

2 answers

solution1
1 2020-02-15 21:58:12

solution2
0 2020-02-15 22:14:25

how to use custom regular expression for more than one call?

Question

2 answers

solution1 1 2020-02-15 21:58:12

solution2 0 2020-02-15 22:14:25

solution1
1 2020-02-15 21:58:12

solution2
0 2020-02-15 22:14:25