简体   繁体   English

无法在Python中解析电子邮件

[英]Unable to Parse Email in Python

I have set of .msg files stored in E:/ drive that I have to read and extract some information from it. 我已经在E:/驱动器中存储了一组.msg文件,我必须从中读取和提取一些信息。 For that i am using the below code in Python 3.6. 为此,我在Python 3.6中使用以下代码。

from email.parser import Parser
with open("E:\Downloads\Test1.msg",encoding="ISO-8859-1") as fp:
    headers = Parser().parse(fp)

print('To: %s' % headers['To'])
print('From: %s' % headers['From'])
print('Subject: %s' % headers['subject'])

In the output I am getting as below. 在输出中,我得到如下。

To: None
From: None
Subject: None

Process finished with exit code 0 流程结束,退出代码为0

I am not getting the actual values in To, FROM and subject fields. 我没有在“收件人”,“从”和“主题”字段中获得实际值。

Any thoughts why it is not printing the actual values? 有什么想法为什么不打印实际值?

My sample .msg file looks like below. 我的示例.msg文件如下所示。

From: Bournemouth.wmt@gmail.com
To: Francis.dell@gmail.com
Subject: orderid: ord1234, circtid: cr1234


Charges:
Annual Charge - 10
Excess Charges - 5

From this message I am trying to extract order id, circuit id from subject and charges from mail body. 我试图从此消息中提取主题的订单ID,电路ID和邮件正文中的费用。

Output1: 输出1:

在此处输入图片说明

Thanks 谢谢

This is the body of the file that you posted on pastebin for us. 这是您在pastebin上为我们发布的文件的主体。

From: ratankumar.shivratri@TechM.com <ratankumar.shivratri@TechM.com>
Sent: Thursday, January 4, 2018 11:58 AM
To: Ratankumar Shivratri
Subject: Cct Id: ONE211, eCo order No: 1CTRP

Charges:

Annual rental - 2,125.00

Maintenance charge - 0.00



Regards

Ratan.

I've been able to obtain data from the headers using the following code. 我已经可以使用以下代码从标头中获取数据。

>>> from email.parser import Parser
>>> p = Parser()
>>> msg = p.parse(open('ratan.msg'))
>>> msg['To']
'Ratankumar Shivratri'
>>> msg['From']
'ratankumar.shivratri@TechM.com <ratankumar.shivratri@TechM.com>'
>>> msg['Subject']
'Cct Id: ONE211, eCo order No: 1CTRP\n '

So that much works. 这样就行了。

The next problem I foresee is that the format of the subject headers seems to be inconsistent across messages. 我预见的下一个问题是,主题标题的格式似乎在消息中不一致。 For instance, in the message in your question, the subject header is 'orderid: ord1234, circtid: cr1234' but in this message it's 'Cct Id: ONE211, eCo order No: 1CTRP'. 例如,在您问题的消息中,主题标头为“ orderid:ord1234,circtid:cr1234”,但在此消息中,其标题为“ Cct ID:ONE211,eCo订单号:1CTRP”。 You want to be able to recover 'order id, circuit id' from messages but these items don't appear in every message. 您希望能够从消息中恢复“订单ID,电路ID”,但这些项目不会出现在每条消息中。

If they did you could probably ferret them out with a regex. 如果他们这样做了,您可能会用正则表达式将它们搜索出来。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM