使用 c# 脚本从 HTML 文件中提取数据

Question

What I need to do : Extract (Information of From, To, Cc and Subject ) and remove them from HTML file.我需要做的：提取（From、To、Cc 和 Subject 的信息）并将它们从 HTML 文件中删除。 Without the use of any 3rd party ( HTMLAgilityPack, etc)不使用任何第三方（HTMLAgilityPack 等）

What I am having trouble with : What will be my approach to get the following(from,to,subject,cc) from the html tags?我遇到了什么问题：从 html 标签获取以下内容（从、到、主题、cc）的方法是什么？

Steps I tried: I tried to get the index of  and the last index of the email @sampleemail.com but I think that is a bad approach since in some html files there will be a lot of "  ", regarding the removal of the from,to,cc and subject I just used the string.Remove(indexOf, i counted the characters from indexOf to lastIndexOf ) function and it worked我尝试过的步骤：我尝试获取的索引和 email @sampleemail.com 的最后一个索引，但我认为这是一个不好的方法，因为在某些 ZFC35FDC70D5FC69D2698ZE23 文件中会有很多“  ”文件 "，关于删除 from,to,cc 和 subject 我只是使用了 string.Remove(indexOf,我计算了从 indexOf 到 lastIndexOf 的字符) function 并且它有效

Sample tag containing information of from:包含来自以下信息的示例标签：

<p class=MsoNormal style='margin-left:120.0pt;text-indent:-120.0pt;tab-stops:120.0pt;mso-layout-grid align:none;text-autospace:none'><b><span style='color:black'>From:<span style='mso-tab-count:1'></span></span></b><span style='color:black'>1234@sampleemail.com<o:p></o:p></span></p>

HTML FILE output: HTML 文件 output：

Answer 1

HTMLAgilityPack is your friend. HTMLAgilityPack是您的朋友。 Simply using XPath like //p[@class ='MsoNormal'] to get tags content in HTML只需使用 XPath 之类的//p[@class ='MsoNormal']即可获取 HTML 中的标签内容

public static void Main()
{
    var html =
    @"<p class=MsoNormal style='margin-left:120.0pt;text-indent:-120.0pt;tab-stops:120.0pt;mso-layout-grid align:none;text-autospace:none'><b><span style='color:black'>From:<span style='mso-tab-count:1'></span></span></b><span style='color:black'>1234@sampleemail.com<o:p></o:p></span></p>                                     ";

    var htmlDoc = new HtmlDocument();
    htmlDoc.LoadHtml(html);

    var nodes = htmlDoc.DocumentNode.SelectNodes("//p[@class ='MsoNormal']");

    foreach(var node in nodes)
        Console.WriteLine(node.InnerText);      
}

Result:结果：

From:1234@sampleemail.com

Update更新

We may use Regex to write this simple parser.我们可以使用Regex来编写这个简单的解析器。 But remember that it cannot clear all cases for complicated html document .但请记住，对于复杂的 html 文档，它无法清除所有情况。

    public static void MainFunc()
    {
        string str = @"<p class='MsoNormal' style='margin-left:120.0pt;text-indent:-120.0pt;tab-stops:120.0pt;mso-layout-grid align:none;text-autospace:none'><b><span style='color:black'>From:<span style='mso-tab-count:1'></span></span></b><span style='color:black'>1234@sampleemail.com<o:p></o:p></span></p>                                     ";
        var result = Regex.Replace(str, "<(?:\"[^\"]*\"['\"]*|'[^']*'['\"]*|[^'\">])+>", "");
        Console.WriteLine(result);
    }

使用 c# 脚本从 HTML 文件中提取数据

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-05-05 05:20:18

使用 c# 脚本从 HTML 文件中提取数据

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-05-05 05:20:18

解决方案1
2 已采纳 2020-05-05 05:20:18