简体   繁体   中英

Regular expression - Trouble with identifying the appropriate character class needed (Python)

I'm trying to write a regular expression that only prints out the first names from the attached document (only the data within the table, I haven't been using the stuff below but will work this in eventually).

https://automatetheboringstuff.com/files/examplePhoneEmailDirectory.pdf

Here is the code I have so far:

import re
import pyperclip

nameRegex = re.compile(r'''
 
[a-zA-Z]+    # first name
\s          # space
[a-zA-Z]+    # last name

''', re.VERBOSE)

text = pyperclip.paste()
extractedText = nameRegex.findall(text)

print(extractedText)

The problem that I'm facing is that when I run the code I get something like the following: Jessie Mckayjmckay

It is giving me the first name, last name and the letters in their email address, stopping at the first number. I've tried to solve this by adding a negative custom character class like this [^\s]. My thinking was that the code would recognise the space after the last name and stop. However, this does not work, I suspect it has something to do with the formatting of the document.

Would anyone be able to help me on this?

After printing output of data I noticed your regex matches also emails and first names togerther as there is a space between them. So my idea is to make sure that first letter of each word is capital ( as data is formated this way):


import re
import pyperclip

nameRegex = re.compile(r'''
 
[A-Z][a-z]+   # first name
\s+          # space
[A-Z][a-z]+    # last name

''', re.VERBOSE)

text = pyperclip.paste()

extractedText = nameRegex.findall(text)

print(extractedText)

Try that:)

I got below result ( part of the result):

['Jessie Mckay', 'Tom Jordan', 'Clayton Cross', 'Rayford Sutton', 'Jerome Gentry', 'Weldon Camacho', 'Quinton Franks', 'Adam Hubbard', 'Jarred Fox', 'Arnoldo Parker', 'Sid Mcdaniel', 'Raymon Combs', 'Ervin Francis', 'Gilberto Austin', 'Lino Barlow', 'Stacey Shepherd', 'Roscoe Terry', 'Eddie Meadows', 'Carlos Simpson', 'Jerome Manning', 'Hong Erickson', 'Burt Graham', 'Mario Sloan', 'Jeffry Mcintosh', 'Owen Malone', 'Jamar Gilbert', 'Guadalupe Ramsey', 'Chet Ramsey', 'Lester Finch', 'Mason Marquez', 'Olen Boyer', 'Sherman Gamble', 'Gerry Mccarthy', 'Jon Jefferson', 'Cristopher Maddox', 'Abel Talley', 'Jerrod Hurst', 'Ezra Pickett', 'Delbert Mcintyre', 'Tom Wilkins', 'Deandre Schneider', 'Louie Gross', 'Cary Mathews', 'Clinton Hernandez', 'Sylvester Goodman', 'Efren Daniels', 'Myles Knapp', 'Trey Hendrix', 'Gerardo Gonzales', 'Collin Wilkinson', 'Hubert Moore', 'Rudolph Joyce', 'Raymundo Griffin', 'Stanton Burris', 'Newton Huff', 'Lonnie Gibson', 'Newton Mendez', 'Dominic Kane', 'Rey Alvarado', 'Maxwell Pittman', 'Freddy Nolan', 'Quentin Kane', 'Marcelo Owens', 'Saul Warner', 'Giuseppe Edwards', 'Glen Duffy', 'Johnson Bird', 'Lon Mays', 'Orval Jones', 'Stefan Wiley', 'Dewayne Vincent', 'Elmo Morton', 'Trenton Randolph', 'Alonzo Noble', 'Stephan Callahan', 'Merrill Morin', 'Antonia Vasquez', 'Jerrod Horne', 'Sammie Blanchard', 'Renaldo Nielsen', 'Rick Logan', 'Xavier Sexton', 'Delmer Chambers', 'Melvin Dixon', 'Randell Wright', 'Kasey Mcbride', 'Long Cohen', 'Hunter Walton', 'Jacques Dean', 'Nicky Cleveland', 'Heath Reeves', 'Dannie Castro', 'Malcolm Pickett', 'Emil Bryant', 'Lonny Trevino',

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM