简体   繁体   中英

how to use custom regular expression for more than one call?

edit: This part was solved, but I have one last problem with my code, see last answer.

I have a textfile structured as follow:

Name1 (Middlename1) LastName
Birthyear
Name2 (Middlename2) LastName
Birthyear
...
NameN (MiddlenameM) LastName
Birthyear

I'm trying to use RE to find the name and the year automatically but I don't know how to combine the two REs since both information are not on the same line:

import re
regexp = re.compile(  r'(( )*)(?P<name>([a-zA-Z]*)( [a-zA-Z]+)? ROCHE)\n'
                      r'(( )*)(?P<year>18\d\d)\n'
                   )

The two REs are working independently but not together. How am I supposed to do it?

You want one regex that scans a string that spans across two lines. You then want to find successive matches. But first:

Names, at least in English-speaking countries, can contain hyphens (Anne-Marie), apostrophes (O'Donnell), periods (John Q. Public), etc. So I am using a regular expression that allows these characters. Also, people may have more than one middle name. What I am trying to illustrate is how to iterate through name/year pairs; you can customize the actual regex to suit your own particular requirements.

Regex:

^(?P<name>(?:[a-z.'-]+(?:\s+[a-z.'-]+)*))\n(?P<year>\d{4})$  Flags: re.M|re.I
  1. ^ Matches the start start of a line.
  2. [az.'-]+ Matches one or more alpha, period, ', or - characters. This is a name element .
  3. (?:\\s+[az.'-]+)* Matches one or more white space characters followed by name element . This is repeated 0 or more times. Thus the named group name consists of 1 or more name elements separated by one or more white space characters.
  4. \\n Matches a newline.
  5. (?P<year>\\d{4})$ Matches 4 digits followed by the end of line or the end of string.

The MULTILINE flag treats the ^ and $ anchors special so that they match in addition to the start and end of string, the start and end of a line.

The code relies on re.finditer to find successive matches:

import re

text = """John Doe
1921
John Q. Public
1987
Anne-Marie Smith
1989
Paul O'Donnell
2001
J. P. Marquand
1893
"""

regexp = re.compile(r"^(?P<name>(?:[a-z.'-]+(?:\s+[a-z.'-]+)*))\n(?P<year>\d{4})$", flags=re.M|re.I)
for m in regexp.finditer(text):
    name = m['name']
    year = m['year']
    # do something with name and year in the second file. Here we are just printing the values.
    print(name, year)

Prints:

John Doe 1921
John Q. Public 1987
Anne-Marie Smith 1989
Paul O'Donnell 2001
J. P. Marquand 1893

You should use re.MULTILINE

r = re.compile(r'(( )*)(?P<name>([a-zA-Z]*)( [a-zA-Z]+)?)(( )+)(?P<surname>([a-zA-Z]+))\n(?P<year>18\d\d)', re.MULTILINE)

m = r.match("""Jan Sebastian  Bach
1892""")

Update #1 More complete example with reading two lines from file and then another two lines.

import re

r = re.compile(r'(( )*)(?P<name>([a-zA-Z]*)( [a-zA-Z]+)?)(( )+)(?P<surname>([a-zA-Z]+))\n(?P<year>18\d\d)', re.MULTILINE)

with open('people.txt') as f:
    while True:
        line1 = f.readline()
        line2 = f.readline()
        if not line2: break
        m = r.match(line1+line2)
        print("name:%s, surname:%s, year:%s" % (m.group('name'), m.group('surname'), m.group('year')))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM