简体   繁体   English

电子邮件的文本挖掘

[英]Text mining of emails

I have a set of emails in a text file.我在文本文件中有一组电子邮件。 I want to extract the body out of it.我想把尸体从里面取出来。 The sample document is shown below.示例文档如下所示。

Email: 1
 ===============


  MIME-Version: 1.0
  Received: by 10.68.8.6 with HTTP; Sat, 7 Apr 2012 01:04:45 -0700 (PDT)
  Date: Sat, 7 Apr 2012 13:34:45 +0530
  Delivered-To: twistyprincess22@gmail.com
  Message-ID: <CAGibXq7_Gjqmp=jOCu2X8+Xngb5QuoqqMQ_ZKbu9jHCoJnFYgA@mail.gmail.com>
  Subject: hello
  From: twisty princess <twistyprincess22@gmail.com>
  To: twisty princess <twistyprincess22@gmail.com>
   Content-Type: multipart/alternative; boundary=047d7b33d826e6762004bd1239b5
  --047d7b33d826e6762004bd1239b5            
  Content-Type: text/plain; charset=ISO-8859-1

   hey How are you doing?

   --047d7b33d826e6762004bd1239b5       
    Content-Type: text/html; charset=ISO-8859-1

     <br><br>hey How are you doing?<br>

     --047d7b33d826e6762004bd1239b5--

So from this text, I just want "hey How are you doing?".所以从这段文字中,我只想说“嘿,你好吗?”。 I want this done using Regular Expressions and C#. Thanks我希望使用正则表达式和 C# 完成此操作。谢谢

Use regex boundary=([^\s]+) to find boundary name使用正则表达式boundary=([^\s]+)查找边界名称

var bname = _boundaryRegex.Match(text).Groups[1].Value;

Then format text capturing regex using bname然后使用bname格式化文本捕获正则表达式

var textCapturer = new Regex(string.Format("--{0}(?<text>.*?)(?=--)",bname);
foreach(var match in textCapturer.Matches(text))
{
    Console.WriteLine(match.Groups["text"]);
}

It finds value of boundary parameter and then tries to match text beetween --BOUNDARY lines.它找到boundary参数的值,然后尝试匹配 --BOUNDARY 行之间的文本。

Though I don't recomend you to do this kind of parsing using regex.尽管我不建议您使用正则表达式进行这种解析。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM