简体   繁体   English

使用 c# 脚本从 HTML 文件中提取数据

[英]Extracting data from HTML file using c# script

What I need to do : Extract (Information of From, To, Cc and Subject ) and remove them from HTML file.我需要做的:提取(From、To、Cc 和 Subject 的信息)并将它们从 HTML 文件中删除。 Without the use of any 3rd party ( HTMLAgilityPack, etc)不使用任何第三方(HTMLAgilityPack 等)

What I am having trouble with : What will be my approach to get the following(from,to,subject,cc) from the html tags?我遇到了什么问题:从 html 标签获取以下内容(从、到、主题、cc)的方法是什么?

Steps I tried: I tried to get the index of <p class=MsoNormal> and the last index of the email @sampleemail.com but I think that is a bad approach since in some html files there will be a lot of " <p class=MsNormal> ", regarding the removal of the from,to,cc and subject I just used the string.Remove(indexOf, i counted the characters from indexOf to lastIndexOf ) function and it worked我尝试过的步骤:我尝试获取<p class=MsoNormal>的索引和 email @sampleemail.com 的最后一个索引,但我认为这是一个不好的方法,因为在某些 ZFC35FDC70D5FC69D2698ZE23 文件中会有很多“ <p class=MsNormal> ”文件<p class=MsNormal> ",关于删除 from,to,cc 和 subject 我只是使用了 string.Remove(indexOf,我计算了从 indexOf 到 lastIndexOf 的字符) function 并且它有效

Sample tag containing information of from:包含来自以下信息的示例标签:

<p class=MsoNormal style='margin-left:120.0pt;text-indent:-120.0pt;tab-stops:120.0pt;mso-layout-grid align:none;text-autospace:none'><b><span style='color:black'>From:<span style='mso-tab-count:1'></span></span></b><span style='color:black'>1234@sampleemail.com<o:p></o:p></span></p>                                     

HTML FILE output: HTML 文件 output:

HTML 文件输出

HTMLAgilityPack is your friend. HTMLAgilityPack是您的朋友。 Simply using XPath like //p[@class ='MsoNormal'] to get tags content in HTML只需使用 XPath 之类的//p[@class ='MsoNormal']即可获取 HTML 中的标签内容

public static void Main()
{
    var html =
    @"<p class=MsoNormal style='margin-left:120.0pt;text-indent:-120.0pt;tab-stops:120.0pt;mso-layout-grid align:none;text-autospace:none'><b><span style='color:black'>From:<span style='mso-tab-count:1'></span></span></b><span style='color:black'>1234@sampleemail.com<o:p></o:p></span></p>                                     ";

    var htmlDoc = new HtmlDocument();
    htmlDoc.LoadHtml(html);

    var nodes = htmlDoc.DocumentNode.SelectNodes("//p[@class ='MsoNormal']");

    foreach(var node in nodes)
        Console.WriteLine(node.InnerText);      
}

Result:结果:

From:1234@sampleemail.com

Update更新

We may use Regex to write this simple parser.我们可以使用Regex来编写这个简单的解析器。 But remember that it cannot clear all cases for complicated html document .但请记住,对于复杂的 html 文档,它无法清除所有情况。

    public static void MainFunc()
    {
        string str = @"<p class='MsoNormal' style='margin-left:120.0pt;text-indent:-120.0pt;tab-stops:120.0pt;mso-layout-grid align:none;text-autospace:none'><b><span style='color:black'>From:<span style='mso-tab-count:1'></span></span></b><span style='color:black'>1234@sampleemail.com<o:p></o:p></span></p>                                     ";
        var result = Regex.Replace(str, "<(?:\"[^\"]*\"['\"]*|'[^']*'['\"]*|[^'\">])+>", "");
        Console.WriteLine(result);
    }

在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM