简体   繁体   中英

Python - regex to grab specific lines from text

I need to grab specific details being parsed in from email bodies, in this case the emails are plain text and formatted like so:

imbad@regex.com
John Doe
+16073948374
2021-04-27T15:38:11+0000
14904

The above is an example output of print(body) parsed in from an email like so:

def parseEmail(popServer, msgNum):
    raw_message=popServer.retr(msgNum)[1]
    str_message=email.message_from_bytes(b'\n'.join(raw_message))
    body=str(str_message.get_payload())

So, if I needed to simply grab the email address and phone number from body object, how might I do that using regex?

I understand regex is most certainly overkill for this, however I'm only repurposing an existing in-house utility that's already written to utilize regex for more complex queries, so it seems the simplest solution here would to modify the regex to grab the desired text. attempts to utilize str.partition() resulted in other unrelated errors.

Thank you in advance.

You could use the following regex patterns:

For the email: \.+@.+\n/g

For the phone number: \^[+]\d+\n/gm

Remove the Initial forward slash if using in python re library.

Note in the email one only the global flag is used, but for the phone number pattern, the multiline flag is also used.

Simply loop over every body, capturing these details and storing them how you like.

In the comments clarifying the question, you indicated that the e-mail address is always on the first line, and the phone number is always on the 3rd line. In that case, I would just split the lines instead of trying to match them with an RE.

lines = body.split("\n")
email = lines[0]
phone = lines[2]

Using Regex.

Ex:

import re

s = """imbad@regex.com
John Doe
+16073948374
2021-04-27T15:38:11+0000
14904"""

ptrn = re.compile(r"(\w+@\w+\.[a-z]+|\+\d{11}\b)")
print(ptrn.findall(s)) 

Output:

['imbad@regex.c', '+16073948374']

To match those patterns on the 1st and the 3rd line you can use 2 capture groups using a single regex:

^([^\s@]+@[^\s@]+)\r?\n.*\r?\n(\+\d+)$

The pattern matches:

  • ^ Start of string
  • ([^\s@]+@[^\s@]+) Capture an email like pattern in group 1 (Just a single @ on the first line)
  • \r?\n.*\r?\n Match (do not capture) the second line
  • (\+\d+) Capture a + and 1+ digits in group 2
  • $ End of string

Regex demo

Example

import re

regex = r"^([^\s@]+@[^\s@]+)\r?\n.*\r?\n(\+\d+)$"

s = ("imbad@regex.com\n"
     "John Doe\n"
     "+16073948374\n"
     "2021-04-27T15:38:11+0000\n"
     "14904")

match = re.match(regex, s, re.MULTILINE)

if match:
        print(f"{match.group(1)}, {match.group(2)}")

Output

imbad@regex.com, +16073948374

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM