简体   繁体   中英

Python extract both names *AND* emails from body using regex in a single swoop

Python3

I need help creating a regex to extract names and emails from a forwarded email body, which will look similar to this always (real emails replaced by dummy emails):

> Begin forwarded message:
> Date: December 20, 2013 at 11:32:39 AM GMT-3
> Subject: My dummy subject
> From: Charlie Brown <aaa@aa-aaa.com>
> To: maria.brown@aaa.com, George Washington <george@washington.com>, =
thomas.jefferson@aaa.com, thomas.alva.edison@aaa.com, Juan =
<juan@aaa.com>, Alan <alan@aaa.com>, Alec <alec@aaa.com>, =
Alejandro <aaa@aaa.com>, Alex <aaa@planeas.com>, Andrea =
<andrea.mery@thomsen.cl>, Andrea <andrea.22@aaa.com>, Andres =
<andres@aaa.com>, Andres <avaldivieso@aaa.com>
> Hi,
> Please reply ASAP with your RSVP
> Bye

My first step was extracting all emails to a list with a custom function that I pass the whole email body to, like so:

def extract_emails(block_of_text):
 t = r'\b[a-zA-Z0-9.-]+@[a-zA-Z0-9.-]+\.[a-zA-Z0-9.-]+\b'
 return re.findall(t, block_of_text)

A couple of days ago I asked a question about extracting names using regex to help me build the function to extract all the names. My idea was to join both later on. I accepted an answer that performed what I asked, and came up with this other function:

def extract_names(block_of_text):
 p = r'[:,] ([\w ]+) \<'
 return re.findall(p, block_of_text)

My problem now was to make the extracted names match the extracted emails, mainly because sometimes there are less names than emails. So I thought, I could better try to build another regex to extract both names and emails,

This is my failed attempt to build such a regex.

[:,]([\w \<]+)([\w.-]+@[\w.-]+\.[\w.-]+)

REGEX101 LINK

Can anyone help and propose a nice, clean regex that grabs both name and email, to a list or dictionary of tuples? Thanks

EDIT: The expected output of the regex in Python would be a list like this:

 [(Charlie Brown', 'aaa@aaa.com'),('','maria.brown@aaa.com'),('George Washington', 'george@washington.com'),('','thomas.jefferson@aaa.com'),('','thomas.alva.edison@aaa.com'),('Juan','juan@aaa.com',('Alan', 'alan@aaa.com'), ('Alec', 'alec@aaa.com'),('Alejandro','aaa@aaa.com'),('Alex', 'aaa@aaa.com'),('Andrea','andrea.mery@thomsen.cl'),('Andrea','andrea.22@aaa.com',('Andres','andres@aaa.com'),('Andres','avaldivieso@aaa.com')] 

Seems like you want something like this.,

[:,]\s*=?\s*(?:([A-Z][a-z]+(?:\s[A-Z][a-z]+)?))?\s*=?\s*.*?([\w.]+@[\w.-]+)

DEMO

>>> import re
>>> s = """ > Begin forwarded message:
>=20
> Date: December 20, 2013 at 11:32:39 AM GMT-3
> Subject: My dummy subject
> From: Charlie Brown <aaa@aa-aaa.com>
> To: maria.brown@aaa.com, George Washington <george@washington.com>, =
thomas.jefferson@aaa.com, thomas.alva.edison@aaa.com, Juan =
<juan@aaa.com>, Alan <alan@aaa.com>, Alec <alec@aaa.com>, =
Alejandro <aaa@aaa.com>, Alex <aaa@planeas.com>, Andrea =
<andrea.mery@thomsen.cl>, Andrea <andrea.22@aaa.com>, Andres =
<andres@aaa.com>, Andres <avaldivieso@aaa.com>
> Hi,
> Please reply ASAP with your RSVP
> Bye"""
>>> re.findall(r'[:,]\s*=?\s*(?:([A-Z][a-z]+(?:\s[A-Z][a-z]+)?))?\s*=?\s*.*?([\w.]+@[\w.-]+)', s)
[('Charlie Brown', 'aaa@aa-aaa.com'), ('', 'maria.brown@aaa.com'), ('George Washington', 'george@washington.com'), ('', 'thomas.jefferson@aaa.com'), ('', 'thomas.alva.edison@aaa.com'), ('Juan', 'juan@aaa.com'), ('Alan', 'alan@aaa.com'), ('Alec', 'alec@aaa.com'), ('Alejandro', 'aaa@aaa.com'), ('Alex', 'aaa@planeas.com'), ('Andrea', 'andrea.mery@thomsen.cl'), ('Andrea', 'andrea.22@aaa.com'), ('Andres', 'andres@aaa.com'), ('Andres', 'avaldivieso@aaa.com')]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM