简体   繁体   English

提取 HTML 表数据从 email 到 csv 文件,第一列值到行标题,使用 ZA7F5F35426B928727111

[英]Extracting HTML table data from email to csv file, 1st column values to row headers, using Python

I am trying to read through an outlook folder and get the ReceivedTime,CC,Subject,HTMLBody but extract the table into columns.我正在尝试通读 outlook 文件夹并获取 ReceivedTime,CC,Subject,HTMLBody 但将表格提取到列中。 I can pull 1) ReceivedTime,CC,Subject,HTMLBody into a dataframe and I can do 2) Extract the HTMLBody tables into a dataframe but am getting stuck on doing both 1) & 2) together.我可以将 1) ReceivedTime,CC,Subject,HTMLBody 拉到 dataframe 中,我可以这样做 2) 将 HTMLBody 表提取到 dataframe 中,但我无法同时执行 1) 和 2)。

Current code:当前代码:

import win32com.client
import pandas as pd
from bs4 import BeautifulSoup


outlook = win32com.client.Dispatch("Outlook.Application")
mapi = outlook.GetNamespace("MAPI")

inbox = mapi.Folders[User@email.com'].Folders['Inbox'].Folders['Subfolder Name']
Mail_Messages = inbox.Items

for mail in Mail_Messages:
     receivedtime = mail.ReceivedTime.strftime('%Y-%m-%d %H:%M:%S')
     cc = mail.CC
     body = mail.HTMLBody
     html_body = BeautifulSoup(body,"lxml")
     html_tables = html_body.find_all('table')[0]

df = pd.read_html(str(html_tables),header=None)[0]
display(df)

The current data frame displays below.当前数据框显示如下。 But I also want the related ReceivedTime, CC, & Subject.但我也想要相关的 ReceivedTime、CC 和主题。

0 0 1 1
0 0 Report Name报告名称 Report.pdf报告.pdf
1 1 Team Name队名 Team A A组
2 2 Project Name项目名称 Project A项目A
3 3 Unique ID Number唯一 ID 号 123456789 123456789
4 4 Due Date截止日期 1/1/2021 2021 年 1 月 1 日

But would like column [0] to be the row headers instead.但希望列 [0] 改为行标题。 So that when each email is read it would produce a dataframe that looks like this, for all the emails in the inbox subfolder:因此,当每个 email 被读取时,它会为收件箱子文件夹中的所有电子邮件生成一个看起来像这样的 dataframe:

0 0 Report Name报告名称 Team Name队名 Project Name项目名称 Unique ID Number唯一 ID 号 Due Date截止日期 ReceivedTime接收时间 CC抄送 Subject主题
1 1 Report.pdf报告.pdf Team A A组 Project A项目A 123456789 123456789 1/5/2021 2021 年 1 月 5 日 1/1/2021 4:38:44 AM 2021 年 1 月 1 日凌晨 4 点 38 分 44 秒 User1@email.com, User2@email.com用户1@email.com,用户2@email.com Action Required:Report A Coming due需要采取的行动:报告 A 即将到期
2 2
3 3
4 4

But am getting stuck, still a begginer pythoner but all the other posts I've seen aren't quite getting me to what I'm trying to do.但是我被卡住了,仍然是一个初学者pythoner,但我看到的所有其他帖子并没有完全让我明白我正在尝试做的事情。 I appreciate any and all help with this.我很感激这方面的任何帮助。

Try this:尝试这个:

import win32com.client
import pandas as pd
from bs4 import BeautifulSoup
from pprint import pprint

outlook = win32com.client.Dispatch("Outlook.Application")
mapi = outlook.GetNamespace("MAPI")

inbox = mapi.Folders['User@email.com'].Folders['Inbox'].Folders['Subfolder Name']
Mail_Messages = inbox.Items

# a list where contents of each e-mail - CC, receiv.time and subject will have been put
contents = []
column_names = ['Report Name', 'Team Name', 'Project Name', 'Unique ID Number', 'Due Date', 'ReceivedTime', 'CC', 'Subject']

for mail in Mail_Messages:

    body = mail.HTMLBody
    html_body = BeautifulSoup(body, "lxml")
    html_tables = html_body.find_all('table')

    # uncomment following lines if you want to have column names defined programatically rather than hardcoded
    # column_names = pd.read_html(str(html_tables), header=None)[0][0]
    # column_names = column_names.tolist()
    # column_names.append("CC")
    # column_names.append("Received Time")
    # column_names.append("Subject")

    # a list containing a single e-mail data - html table, CC, receivedTime and subject
    row = pd.read_html(str(html_tables), header=None)[0][1]
    row = row.tolist()
    row.append(mail.CC)
    row.append(mail.ReceivedTime.strftime('%Y-%m-%d %H:%M:%S'))
    row.append(mail.Subject)

    # appending each full row to a list
    contents.append(row)


# and finally converting a list into dataframe
df = pd.DataFrame(contents, columns=column_names)

pprint(df)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用beautifulsoup4(第2行,第1列和第6列)从html表中提取值 - extracting values from html table using beautifulsoup4 (2nd row onwards, 1st and 6th column) 使用 python 使用 csv 文件的第二行信息更新第一行标题 - Update 1st row headers with info from the 2nd row for csv file using python 在Python上的.csv文件上获取第一列值 - Get 1st column values on .csv file on python 使用python进行网络抓取-不断从jquery表中获取重复的第一行值 - Webscraping using python - Keep getting repeat 1st row values from jquery table Python:根据第一列中的值提取excel单元格值 - Python: Extracting excel cell values based on value in 1st column 使用Python从HTML表提取数据并打印到CSV时出现问题 - Issues Extracting data from HTML table and printing to CSV using Python Import multiple csv files into pandas and concatenate into one DataFrame where 1st column same in all csv and no headers of data just file name - Import multiple csv files into pandas and concatenate into one DataFrame where 1st column same in all csv and no headers of data just file name python:按行名从csv文件中提取值 - python: extracting values from csv file by row name 如何更改列中的文本,然后将第一行与列标题结合起来? - How to change texts in columns and then combine 1st row with the column headers? 取 CSV 的第一行并将其转换为新列中的默认值 - Take 1st row of CSV and turn it into a default value in a new column
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM