[英]convert mailbox message to PDF: which part?
I am trying to code a script that will export all my messages (mailbox mbox format) into PDF files with pdfkit. 我正在尝试编写一个脚本,将使用pdfkit将我的所有消息(邮箱mbox格式)导出为PDF文件。
It seems that all messages in my mailbox are multipart, and I'm struggling with figuring out which part is the relevant one. 似乎我邮箱中的所有邮件都是多部分的,我正在努力弄清楚哪个部分是相关部分。 If I iterate through all parts with the code below, I will generate typically 3 to 5 PDFs per e-mail, with only one of them being similar to what I would see if I opened the e-mail with an e-mail client. 如果我使用下面的代码遍历所有部分,我将通过电子邮件生成通常3到5个PDF,其中只有一个类似于我用电子邮件客户端打开电子邮件时所看到的。 The other parts are typically either raw text or something that looks like this: x92O&S\\xd2\\x0c\\xb4e\\xee\\x0fh\\xc68\\x1
(hexadecimal?). 其他部分通常是原始文本或类似的东西: x92O&S\\xd2\\x0c\\xb4e\\xee\\x0fh\\xc68\\x1
(十六进制?)。
I tried to solve the issue by including a test to filter for HTML ( if bool(BeautifulSoup(html, "html.parser").find())
) but it seems that this does not work. 我尝试通过包含一个过滤HTML的测试来解决这个问题( if bool(BeautifulSoup(html, "html.parser").find())
)但似乎这不起作用。
for part in message.walk():
partcounter +=1
try:
html = str(part.get_payload(decode=True))
if bool(BeautifulSoup(html, "html.parser").find()):
print(str(messagecounter)+'-'+str(partcounter)+' - '+"payload is HTML")
filename = 'C:/Email_forwarding/Attachments/'+str(messagecounter)+"-"+str(partcounter)+'.pdf'#this keeps the file only for the last part, which seems to be correct
pdfkit.from_string(html,filename, configuration=config)
print(str(messagecounter)+'-'+str(partcounter)+' - '+"created %s" %(filename))
else:
print(str(messagecounter)+'-'+str(partcounter)+' - '+"payload is not HTML")
except:
print(str(messagecounter)+'-'+str(partcounter)+' - '+"no payload or failed to convert")
How can I detect which part of a multipart e-mail contains actual, interpretable HTML? 如何检测多部分电子邮件的哪个部分包含实际的可解释HTML?
You can use part.get_content_type()
to filter through the different parts of the message: 您可以使用part.get_content_type()
来过滤消息的不同部分:
for part in message.walk():
if part.get_content_type() == 'text/html':
html = str(part.get_payload(decode=True))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.