簡體   English   中英

如何從其html源提取頁面上可見的文本?

[英]How can I extract text visible on a page from its html source?

我嘗試了HtmlAgilityPack和以下代碼,但它沒有從html列表中捕獲文本:

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlStr);
HtmlNode node = doc.DocumentNode;
return node.InnerText;

這是失敗的代碼:

<as html>
<p>This line is picked up <b>correctly</b>.  List items hasn't...</p>
<p><ul>
<li>List Item 1</li>
<li>List Item 2</li>
<li>List Item 3</li> 
<li>List Item 4</li>
</ul></p>
</as html>

因為你需要遍歷樹和以某種方式連接所有節點的InnerText

以下代碼對我有用:

string StripHTML(string htmlStr)
{
    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(htmlStr);
    var root = doc.DocumentNode;
    string s = "";
    foreach (var node in root.DescendantNodesAndSelf())
    {
        if (!node.HasChildNodes)
        {
            string text = node.InnerText;
            if (!string.IsNullOrEmpty(text))
            s += text.Trim() + " ";                     
        }
    }
    return s.Trim();
}

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM