简体   繁体   中英

Parsing the “Sent” line in an email

I have a folder of ~150 emails, all saved as HTML files (Firefox extensions), and I need to capture the year that is always found on the "Sent" line; as shown in the photo below.

在此处输入图片说明

I tried using RegEx but that failed; it wouldn't print any result at all indicating to me that my RegEx wasn't working. I tried using the get_payload() and message_from_string() commands from the email module but since it's an HTML document those failed. I then tried using BeautifulSoup to capture the entire email and then parse just the "Sent" line, but I failed for reasons unknown. I am not an expert with any of these modules so any and all help would be appreciated.

The relevant code I've tried:

for filename in os.listdir(path):
    file_path = os.path.join(path, filename)
    if os.path.isfile(file_path):
        html_ = open(file_path, 'r').read()
        soup_ = BeautifulSoup(html, 'lxml')
        pattern = re.compile(r'Sent:/s([/d]{4})')
        txt = html.read()
        dates = pattern.findall(txt)
        if "Sent" in line:
            print("Date:", ''.join(dates))

Your regex (I think the slash is just a typo) does not really match the character between Sent: and the year. You may fix the regex as

r'Sent:.*?\b(\d{4})\b'

Or - to account for the fact that the Sent appears at the start of a line:

r'(?m)^Sent:.*?\b(\d{4})\b'

Details :

  • (?m)^ - start of a line
  • Sent: - a literal char sequence
  • .*? - any 0+ chars other than line break chars, as few as possible
  • \\b(\\d{4})\\b - a whole word consisting of 4 digits (captured into Group 1 and thus returned as the result of re.findall .)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM