简体   繁体   English

如何从包含电子邮件的文本文件中提取正文[安然数据集]

[英]How to extract body from a text file containing an e-mail [Enron Data Set]

I have enron e-mail data set as a folder, which contains e-mails in the form of text files, and I want to extract "body" part of those e-mails 我将安然电子邮件数据设置为文件夹,其中包含文本文件形式的电子邮件,并且我想提取这些电子邮件的“正文”部分

The problem is, fields like sender's email, receiver's email are specified by To:, From: etc. But Body does not start with any heading, it just starts after all the other fields have been specified. 问题是,发件人的电子邮件,收件人的电子邮件之类的字段是由To :、发件人:等指定的。但是Body并不以任何标题开头,它只是在指定了所有其他字段之后才开始。

now, a text file can contain many bodies (in case of email threads/conversation). 现在,一个文本文件可以包含许多正文(以电子邮件主题/会话为例)。 I want to extract the body(ies) from these files. 我想从这些文件中提取主体。 Can javamail api be used, if yes, then how? 可以使用javamail api吗,如果可以,怎么办? It is just offline data set, in the form of text files in my hard disk drive, not on internet. 它只是离线数据集,以文本文件的形式存在于我的硬盘驱动器中,而不是在互联网上。

The file is like this- 该文件是这样的-

 Message-ID: <16159836.1075855377439.JavaMail.evans@thyme> Date: Fri, 7 Dec 2001 10:06:42 -0800 (PST) From: heather.dunton@enron.com To: k..allen@enron.com Subject: RE: West Position Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-From: Dunton, Heather </O=ENRON/OU=NA/CN=RECIPIENTS/CN=HDUNTON> X-To: Allen, Phillip K. </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Pallen> X-cc: X-bcc: X-Folder: \\Phillip_Allen_Jan2002_1\\Allen, Phillip K.\\Inbox X-Origin: Allen-P X-FileName: pallen (Non-Privileged).pst Please let me know if you still need Curve Shift. Thanks, Heather -----Original Message----- From: Allen, Phillip K. Sent: Friday, December 07, 2001 5:14 AM To: Dunton, Heather Subject: RE: West Position Heather, Did you attach the file to this email? -----Original Message----- From: Dunton, Heather Sent: Wednesday, December 05, 2001 1:43 PM To: Allen, Phillip K.; Belden, Tim Subject: FW: West Position Attached is the Delta position for 1/16, 1/30, 6/19, 7/13, 9/21 -----Original Message----- From: Allen, Phillip K. Sent: Wednesday, December 05, 2001 6:41 AM To: Dunton, Heather Subject: RE: West Position Heather, This is exactly what we need. Would it possible to add the prior day for each of the dates below to the pivot table. In order to validate the curve shift on the dates below we also need the prior days ending positions. Thank you, Phillip Allen -----Original Message----- From: Dunton, Heather Sent: Tuesday, December 04, 2001 3:12 PM To: Belden, Tim; Allen, Phillip K. Cc: Driscoll, Michael M. Subject: West Position Attached is the Delta position for 1/18, 1/31, 6/20, 7/16, 9/24 << File: west_delta_pos.xls >> Let me know if you have any questions. Heather 

Please provide an example file, the most complex one if possible. 请提供示例文件,如果可能的话,请提供最复杂的文件。 The job would be to programmatically open every file, parse its content, and extract email's bodies. 要做的工作是以编程方式打开每个文件,解析其内容并提取电子邮件的正文。 Where do you want to store it then ? 那你想在哪里存储呢? Which OS are you running ? 您正在运行哪个操作系统?

If each file is a single message in MIME format, you can use the JavaMail MimeMessage constructor that takes an InputStream. 如果每个文件都是MIME格式的单个消息,则可以使用采用InputStream的JavaMail MimeMessage构造函数。 You can then use the JavaMail APIs to extract the contents of the message. 然后,您可以使用JavaMail API提取消息的内容。 See the JavaMail FAQ, javadocs, web site, specification, etc. 请参阅JavaMail常见问题解答,javadocs,网站,规范等。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM