[英]Extracting HTML table data from email to csv file, 1st column values to row headers, using Python
I am trying to read through an outlook folder and get the ReceivedTime,CC,Subject,HTMLBody but extract the table into columns.我正在尝试通读 outlook 文件夹并获取 ReceivedTime,CC,Subject,HTMLBody 但将表格提取到列中。 I can pull 1) ReceivedTime,CC,Subject,HTMLBody into a dataframe and I can do 2) Extract the HTMLBody tables into a dataframe but am getting stuck on doing both 1) & 2) together.我可以将 1) ReceivedTime,CC,Subject,HTMLBody 拉到 dataframe 中,我可以这样做 2) 将 HTMLBody 表提取到 dataframe 中,但我无法同时执行 1) 和 2)。
Current code:当前代码:
import win32com.client
import pandas as pd
from bs4 import BeautifulSoup
outlook = win32com.client.Dispatch("Outlook.Application")
mapi = outlook.GetNamespace("MAPI")
inbox = mapi.Folders[User@email.com'].Folders['Inbox'].Folders['Subfolder Name']
Mail_Messages = inbox.Items
for mail in Mail_Messages:
receivedtime = mail.ReceivedTime.strftime('%Y-%m-%d %H:%M:%S')
cc = mail.CC
body = mail.HTMLBody
html_body = BeautifulSoup(body,"lxml")
html_tables = html_body.find_all('table')[0]
df = pd.read_html(str(html_tables),header=None)[0]
display(df)
The current data frame displays below.当前数据框显示如下。 But I also want the related ReceivedTime, CC, & Subject.但我也想要相关的 ReceivedTime、CC 和主题。
0 0 | 1 1 | |
---|---|---|
0 0 | Report Name报告名称 | Report.pdf报告.pdf |
1 1 | Team Name队名 | Team A A组 |
2 2 | Project Name项目名称 | Project A项目A |
3 3 | Unique ID Number唯一 ID 号 | 123456789 123456789 |
4 4 | Due Date截止日期 | 1/1/2021 2021 年 1 月 1 日 |
But would like column [0] to be the row headers instead.但希望列 [0] 改为行标题。 So that when each email is read it would produce a dataframe that looks like this, for all the emails in the inbox subfolder:因此,当每个 email 被读取时,它会为收件箱子文件夹中的所有电子邮件生成一个看起来像这样的 dataframe:
0 0 | Report Name报告名称 | Team Name队名 | Project Name项目名称 | Unique ID Number唯一 ID 号 | Due Date截止日期 | ReceivedTime接收时间 | CC抄送 | Subject主题 |
---|---|---|---|---|---|---|---|---|
1 1 | Report.pdf报告.pdf | Team A A组 | Project A项目A | 123456789 123456789 | 1/5/2021 2021 年 1 月 5 日 | 1/1/2021 4:38:44 AM 2021 年 1 月 1 日凌晨 4 点 38 分 44 秒 | User1@email.com, User2@email.com用户1@email.com,用户2@email.com | Action Required:Report A Coming due需要采取的行动:报告 A 即将到期 |
2 2 | ||||||||
3 3 | ||||||||
4 4 |
But am getting stuck, still a begginer pythoner but all the other posts I've seen aren't quite getting me to what I'm trying to do.但是我被卡住了,仍然是一个初学者pythoner,但我看到的所有其他帖子并没有完全让我明白我正在尝试做的事情。 I appreciate any and all help with this.我很感激这方面的任何帮助。
Try this:尝试这个:
import win32com.client
import pandas as pd
from bs4 import BeautifulSoup
from pprint import pprint
outlook = win32com.client.Dispatch("Outlook.Application")
mapi = outlook.GetNamespace("MAPI")
inbox = mapi.Folders['User@email.com'].Folders['Inbox'].Folders['Subfolder Name']
Mail_Messages = inbox.Items
# a list where contents of each e-mail - CC, receiv.time and subject will have been put
contents = []
column_names = ['Report Name', 'Team Name', 'Project Name', 'Unique ID Number', 'Due Date', 'ReceivedTime', 'CC', 'Subject']
for mail in Mail_Messages:
body = mail.HTMLBody
html_body = BeautifulSoup(body, "lxml")
html_tables = html_body.find_all('table')
# uncomment following lines if you want to have column names defined programatically rather than hardcoded
# column_names = pd.read_html(str(html_tables), header=None)[0][0]
# column_names = column_names.tolist()
# column_names.append("CC")
# column_names.append("Received Time")
# column_names.append("Subject")
# a list containing a single e-mail data - html table, CC, receivedTime and subject
row = pd.read_html(str(html_tables), header=None)[0][1]
row = row.tolist()
row.append(mail.CC)
row.append(mail.ReceivedTime.strftime('%Y-%m-%d %H:%M:%S'))
row.append(mail.Subject)
# appending each full row to a list
contents.append(row)
# and finally converting a list into dataframe
df = pd.DataFrame(contents, columns=column_names)
pprint(df)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.