I have a folder of ~150 emails, all saved as HTML files (Firefox extensions), and I need to capture the year that is always found on the "Sent" line; as shown in the photo below.
I tried using RegEx but that failed; it wouldn't print any result at all indicating to me that my RegEx wasn't working. I tried using the get_payload()
and message_from_string()
commands from the email
module but since it's an HTML document those failed. I then tried using BeautifulSoup to capture the entire email and then parse just the "Sent" line, but I failed for reasons unknown. I am not an expert with any of these modules so any and all help would be appreciated.
The relevant code I've tried:
for filename in os.listdir(path):
file_path = os.path.join(path, filename)
if os.path.isfile(file_path):
html_ = open(file_path, 'r').read()
soup_ = BeautifulSoup(html, 'lxml')
pattern = re.compile(r'Sent:/s([/d]{4})')
txt = html.read()
dates = pattern.findall(txt)
if "Sent" in line:
print("Date:", ''.join(dates))
Your regex (I think the slash is just a typo) does not really match the character between Sent:
and the year. You may fix the regex as
r'Sent:.*?\b(\d{4})\b'
Or - to account for the fact that the Sent
appears at the start of a line:
r'(?m)^Sent:.*?\b(\d{4})\b'
Details :
(?m)^
- start of a line Sent:
- a literal char sequence .*?
- any 0+ chars other than line break chars, as few as possible \\b(\\d{4})\\b
- a whole word consisting of 4 digits (captured into Group 1 and thus returned as the result of re.findall
.)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.