Parsing the “Sent” line in an email

Question

I have a folder of ~150 emails, all saved as HTML files (Firefox extensions), and I need to capture the year that is always found on the "Sent" line; as shown in the photo below.

I tried using RegEx but that failed; it wouldn't print any result at all indicating to me that my RegEx wasn't working. I tried using the get_payload() and message_from_string() commands from the email module but since it's an HTML document those failed. I then tried using BeautifulSoup to capture the entire email and then parse just the "Sent" line, but I failed for reasons unknown. I am not an expert with any of these modules so any and all help would be appreciated.

The relevant code I've tried:

for filename in os.listdir(path):
    file_path = os.path.join(path, filename)
    if os.path.isfile(file_path):
        html_ = open(file_path, 'r').read()
        soup_ = BeautifulSoup(html, 'lxml')
        pattern = re.compile(r'Sent:/s([/d]{4})')
        txt = html.read()
        dates = pattern.findall(txt)
        if "Sent" in line:
            print("Date:", ''.join(dates))

Answer 1

Your regex (I think the slash is just a typo) does not really match the character between Sent: and the year. You may fix the regex as

r'Sent:.*?\b(\d{4})\b'

Or - to account for the fact that the Sent appears at the start of a line:

r'(?m)^Sent:.*?\b(\d{4})\b'

Details :

(?m)^ - start of a line
Sent: - a literal char sequence
.*? - any 0+ chars other than line break chars, as few as possible
\\b(\\d{4})\\b - a whole word consisting of 4 digits (captured into Group 1 and thus returned as the result of re.findall .)

Parsing the “Sent” line in an email

Question

1 answers

solution1
2 ACCPTED 2017-01-24 19:01:19

Parsing the “Sent” line in an email

Question

1 answers

solution1 2 ACCPTED 2017-01-24 19:01:19

solution1
2 ACCPTED 2017-01-24 19:01:19