Extracting HTML table data from email to csv file, 1st column values to row headers, using Python

Question

I am trying to read through an outlook folder and get the ReceivedTime,CC,Subject,HTMLBody but extract the table into columns. I can pull 1) ReceivedTime,CC,Subject,HTMLBody into a dataframe and I can do 2) Extract the HTMLBody tables into a dataframe but am getting stuck on doing both 1) & 2) together.

Current code:

import win32com.client
import pandas as pd
from bs4 import BeautifulSoup


outlook = win32com.client.Dispatch("Outlook.Application")
mapi = outlook.GetNamespace("MAPI")

inbox = mapi.Folders[User@email.com'].Folders['Inbox'].Folders['Subfolder Name']
Mail_Messages = inbox.Items

for mail in Mail_Messages:
     receivedtime = mail.ReceivedTime.strftime('%Y-%m-%d %H:%M:%S')
     cc = mail.CC
     body = mail.HTMLBody
     html_body = BeautifulSoup(body,"lxml")
     html_tables = html_body.find_all('table')[0]

df = pd.read_html(str(html_tables),header=None)[0]
display(df)

The current data frame displays below. But I also want the related ReceivedTime, CC, & Subject.

	0	1
0	Report Name	Report.pdf
1	Team Name	Team A
2	Project Name	Project A
3	Unique ID Number	123456789
4	Due Date	1/1/2021

But would like column [0] to be the row headers instead. So that when each email is read it would produce a dataframe that looks like this, for all the emails in the inbox subfolder:

0	Report Name	Team Name	Project Name	Unique ID Number	Due Date	ReceivedTime	CC	Subject
1	Report.pdf	Team A	Project A	123456789	1/5/2021	1/1/2021 4:38:44 AM	User1@email.com, User2@email.com	Action Required:Report A Coming due
2
3
4

But am getting stuck, still a begginer pythoner but all the other posts I've seen aren't quite getting me to what I'm trying to do. I appreciate any and all help with this.

Answer 1

Try this:

import win32com.client
import pandas as pd
from bs4 import BeautifulSoup
from pprint import pprint

outlook = win32com.client.Dispatch("Outlook.Application")
mapi = outlook.GetNamespace("MAPI")

inbox = mapi.Folders['User@email.com'].Folders['Inbox'].Folders['Subfolder Name']
Mail_Messages = inbox.Items

# a list where contents of each e-mail - CC, receiv.time and subject will have been put
contents = []
column_names = ['Report Name', 'Team Name', 'Project Name', 'Unique ID Number', 'Due Date', 'ReceivedTime', 'CC', 'Subject']

for mail in Mail_Messages:

    body = mail.HTMLBody
    html_body = BeautifulSoup(body, "lxml")
    html_tables = html_body.find_all('table')

    # uncomment following lines if you want to have column names defined programatically rather than hardcoded
    # column_names = pd.read_html(str(html_tables), header=None)[0][0]
    # column_names = column_names.tolist()
    # column_names.append("CC")
    # column_names.append("Received Time")
    # column_names.append("Subject")

    # a list containing a single e-mail data - html table, CC, receivedTime and subject
    row = pd.read_html(str(html_tables), header=None)[0][1]
    row = row.tolist()
    row.append(mail.CC)
    row.append(mail.ReceivedTime.strftime('%Y-%m-%d %H:%M:%S'))
    row.append(mail.Subject)

    # appending each full row to a list
    contents.append(row)


# and finally converting a list into dataframe
df = pd.DataFrame(contents, columns=column_names)

pprint(df)

Extracting HTML table data from email to csv file, 1st column values to row headers, using Python

Question

1 answers

solution1
0 ACCPTED 2021-04-28 21:43:20

Extracting HTML table data from email to csv file, 1st column values to row headers, using Python

Question

1 answers

solution1 0 ACCPTED 2021-04-28 21:43:20

solution1
0 ACCPTED 2021-04-28 21:43:20