解析HTML文檔：正則表達式還是LINQ？

Question

嘗試解析HTML文檔並提取一些元素（文本文件的任何鏈接）。

當前的策略是將HTML文檔加載到字符串中。 然后找到文本文件鏈接的所有實例。 它可以是任何文件類型，但對於這個問題，它是一個文本文件。

最終目標是擁有一個IEnumerable字符串對象列表。 這部分很簡單，但解析數據是個問題。

<html>
<head><title>Blah</title>
</head>
<body>
<br/>
<div>Here is your first text file: <a href="http://myServer.com/blah.txt"></div>
<span>Here is your second text file: <a href="http://myServer.com/blarg2.txt"></span>
<div>Here is your third text file: <a href="http://myServer.com/bat.txt"></div>
<div>Here is your fourth text file: <a href="http://myServer.com/somefile.txt"></div>
<div>Thanks for visiting!</div>
</body>
</html>

最初的方法是：

將字符串加載到XML文檔中，並以Linq-To-Xml方式對其進行攻擊。
創建一個正則表達式，查找以href=開頭，以.txt結尾的字符串

問題是：

那個正則表達式是什么樣的？ 我是一個正則表達式的新手，這是我的正則表達式學習的一部分。
您將使用哪種方法提取標簽列表？
這將是最高效的方式？
哪種方法最易讀/可維護？

更新：在HTML Agility Pack建議中向Matthew致敬。 它工作得很好！ XPath建議也適用。 我希望我能將這兩個答案標記為“答案”，但我顯然不能。 它們都是解決問題的有效方法。

這是一個使用Jeff建議的正則表達式的C＃控制台應用程序。 它讀取字符串很好，並且不包括任何未以.txt結尾的href。 使用給定的示例，它正確地.txt.snarg結果中包含.txt.snarg文件（如HTML字符串函數中所提供的）。

 using System; using System.Collections.Generic; using System.Text; using System.Text.RegularExpressions; using System.IO; namespace ParsePageLinks { class Program { static void Main(string[] args) { GetAllLinksFromStringByRegex(); } static List<string> GetAllLinksFromStringByRegex() { string myHtmlString = BuildHtmlString(); string txtFileExp = "href=\\"([^\\\\\\"]*\\\\.txt)\\""; List<string> foundTextFiles = new List<string>(); MatchCollection textFileLinkMatches = Regex.Matches(myHtmlString, txtFileExp, RegexOptions.IgnoreCase); foreach (Match m in textFileLinkMatches) { foundTextFiles.Add( m.Groups[1].ToString()); // this is your captured group } return files; } static string BuildHtmlString() { return new StringReader(@"<html><head><title>Blah</title></head><body><br/> <div>Here is your first text file: <a href=""http://myServer.com/blah.txt""></div> <span>Here is your second text file: <a href=""http://myServer.com/blarg2.txt""></span> <div>Here is your third text file: <a href=""http://myServer.com/bat.txt.snarg""></div> <div>Here is your fourth text file: <a href=""http://myServer.com/somefile.txt""></div> <div>Thanks for visiting!</div></body></html>").ReadToEnd(); } } }

Answer 1

都不是。 將其加載到（X / HT）MLD文檔中並使用XPath，這是一種操作XML的標准方法，功能非常強大。 要查看的功能是SelectNodes和SelectSingleNode 。

由於您顯然使用的是HTML（而不是XHTML），因此您應該使用HTML Agility Pack 。 大多數方法和屬性都與相關的XML類相匹配。

使用XPath的示例實現：

    HtmlDocument doc = new HtmlDocument();
    doc.Load(new StringReader(@"<html>
<head><title>Blah</title>
</head>
<body>
<br/>
<div>Here is your first text file: <a href=""http://myServer.com/blah.txt""></div>
<span>Here is your second text file: <a href=""http://myServer.com/blarg2.txt""></span>
<div>Here is your third text file: <a href=""http://myServer.com/bat.txt""></div>
<div>Here is your fourth text file: <a href=""http://myServer.com/somefile.txt""></div>
<div>Thanks for visiting!</div>
</body>
</html>"));
        HtmlNode root = doc.DocumentNode;
        // 3 = ".txt".Length - 1.  See http://stackoverflow.com/questions/402211/how-to-use-xpath-function-in-a-xpathexpression-instance-programatically
        HtmlNodeCollection links = root.SelectNodes("//a[@href['.txt' = substring(., string-length(.)- 3)]]");
    IList<string> fileStrings;
    if(links != null)
    {
        fileStrings = new List<string>(links.Count);
        foreach(HtmlNode link in links)
        fileStrings.Add(link.GetAttributeValue("href", null));
    }
    else
        fileStrings = new List<string>(0);

Answer 2

我會推薦正則表達式。 為什么？

靈活（不區分大小寫，易於添加新文件擴展名，要檢查的元素等）
快寫
快跑

只要你可以寫正則表達式，正則表達式將不難閱讀。

使用它作為正則表達式：

href="([^"]*\\.txt)"

說明：

它在文件名周圍有括號，這將產生一個“捕獲的組”，您可以在找到每個匹配后訪問它。
它必須逃脫“。” 通過使用正則表達式轉義字符，反斜杠。
它必須匹配除雙引號之外的任何字符：[^“]直到找到它
“.txt”

它轉換為一個轉義字符串，如下所示：

string txtExp = "href=\"([^\\\"]*\\.txt)\"

然后你可以迭代你的匹配：

Matches txtMatches = Regex.Matches(input, exp, RegexOptions.IgnoreCase);
foreach(Match m in txtMatches) {
  string filename = m.Groups[1]; // this is your captured group
}

Answer 3

REGEX並不快，實際上它比.NET中的本機字符串解析更慢。 不要相信我，親眼看看。

以上示例都不比直接轉到DOM更快。

HTMLDocument doc = wb.Document;
var links = doc.Links;

Answer 4

另外，Matthew Flaschen的建議， DOM （例如，如果您患有X？L過敏症爆發）

它有時會得到一個糟糕的代表 - 我想因為實現很有趣，並且原生COM接口在沒有一些（次要）智能助手的情況下有點笨拙，但我發現它是一種強大，穩定，直觀/可探索的方式來解析和操縱HTML。

解析HTML文檔：正則表達式還是LINQ？

問題描述

4 個解決方案

解決方案1
13 2009-05-25 18:00:57

解決方案2
1 已采納 2009-05-25 18:25:26

解決方案3
0 2011-03-01 19:52:06

解決方案4
0 2009-05-25 18:28:13

解析HTML文檔：正則表達式還是LINQ？

問題描述

4 個解決方案

解決方案1 13 2009-05-25 18:00:57

解決方案2 1 已采納 2009-05-25 18:25:26

解決方案3 0 2011-03-01 19:52:06

解決方案4 0 2009-05-25 18:28:13

解決方案1
13 2009-05-25 18:00:57

解決方案2
1 已采納 2009-05-25 18:25:26

解決方案3
0 2011-03-01 19:52:06

解決方案4
0 2009-05-25 18:28:13