简体   繁体   中英

How can I improve my code that uses regex to extract names, telephone numbers and emails?

I'm teaching myself to code in python using automate the boring stuff book. One of the projects is to use regex to extract information from the data sheet. The code provided in the book doesn't work well, I think the example data has changed so I adapted the code to remove the errors.

Example of errors - extracted email address - 1852nvinson8@hotmail.com extracted name - comMilton Wade

I added couple lines to remove the com/net in the name and the 4 digits at the start of the email address.

(\d{4})                           # cheat code to remove digits
([A-Z][a-z]+)                     # starts with capital - first name

Is there a better way to extract the data without adding code to remove the errors?

Sample of example data

Norbert Vinson385-868-1852nvinson8@hotmail.comMilton Wade931-883-8104mwade90@gmail.comLauren Barnett573-991-4106lbarnett80@sbcglobal.netCary Kirby859-271-7097ckirby9@msn.comBiostatisticianClark Salinas845-641-5553csalinas16@mac.comOfficerHugo Cross500-760-4858hcross@optonline.netAssistantDomenic Molina256-975-9610dmolina@me.com

my code


import re, pyperclip

# create regex for name

nameRegex = re.compile(r'''(
([A-Z][a-z]+)                    # starts with capital - first name
\s                               # space
([a-zA-Z]+)                      # last name

)''', re.VERBOSE)

# Create a regex for phone numbers

# 415-555-0000, 555-0000, (415) 555-0000, 555-0000 ext 12345, ext. 12345, x12345
phoneRegex = re.compile(r'''(
(\d{3}|\(\d{3}\))?                # area code
(\s|-|\.)?                        # separator
(\d{3})                           # first 3 digits
(\s|-|\.)                         # separator
(\d{4})                           # last 4 digits
(\s*(ext|x|ext.)\s*(\d{2,5}))?    # extension
)''', re.VERBOSE)


# Create a regex for email addresses

# some.+_thing@something.com
emailRegex = re.compile(r'''(
(\d{4})                           # cheat to remove digits
([a-zA-Z0-9_.+]+)                 # name part
(@+)                              # @ symbol
([a-zA-Z0-9_.+]+)                 # domain name
(\.com|\.net+)                    # cheat for TLD


# Find all Matches in the Clipboard Text
text = str(pyperclip.paste())

matches = []

for groups in nameRegex.findall(text):
    matches.append(groups[0])

for groups in phoneRegex.findall(text):
    phoneNumbers = '-'.join([groups[1],groups[3], groups[5]])
    matches.append(phoneNumbers)
    
 
for groups in emailRegex.findall(text):
    emailAddress = ''.join([groups[2],groups[3],groups[4],groups[5]])
    matches.append(emailAddress)
    

# print the extracted email/phone 

if len(matches) > 0:
    print('\n'.join(matches))
else:
    print('No phone numbers or email addresses found.')
print(matches)

I can propose you this way of extracting data "one rule to rule them all" :)

(?P<Position>[A-Z][a-z]+){0,}(?P<FullName>(?P<FirstName>[A-Z][a-z]+)\s(?P<LastName>[A-Z][a-z]+))(?P<SocialNumber>\d{3}-\d{3}-\d{4})(?P<email>[\w\.-]+@[\w]+\.[a-z]+)

you can try it alive at https://regex101.com/r/h7kW07/1

btw if you try to create new email account at yahoo with your example (some.+ thing@something.com), it will fail with error: "You can only use letters, numbers, periods ('.'), and underscores (' ') in your username." So good pattern for this case will be like ([\\w.]+)

ps it won't pass emails with '+'

Since different desired parts (eg position, name, number, email) are entangled together, it would be more efficient to extract all of them using one pattern, that could be:

(?P<POS>[A-Z][a-z]+)?(?P<NME>(?:[A-Z][a-z\s]+)+)(?P<NUM>(?:\d+\-?)+)(?P<EML>\w+\@\w+(?:\.com|\.net))

You may try it interactively at the following website:

https://regex101.com/r/fcwzfx/1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM