HTML敏捷包 - 刪除不需要的標簽而不刪除內容？

Question

我在這里看到了一些相關的問題，但他們並沒有完全談論我面臨的同樣問題。

我想使用HTML Agility Pack從HTML中刪除不需要的標記，而不會丟失標記中的內容。

例如，在我的場景中，我想保留標簽“ b ”，“ i ”和“ u ”。

並輸入如下：

my paragraph <div>and my div</div> are italic and bold

生成的HTML應為：

my paragraph and my div are italic and bold

我嘗試使用HtmlNode的Remove方法，但它也刪除了我的內容。 有什么建議？

Answer 1

我根據Oded的建議寫了一個算法。 這里是。 奇跡般有效。

它會刪除除strong ， em ， u和原始文本節點之外的所有標記。

internal static string RemoveUnwantedTags(string data)
{
    if(string.IsNullOrEmpty(data)) return string.Empty;

    var document = new HtmlDocument();
    document.LoadHtml(data);

    var acceptableTags = new String[] { "strong", "em", "u"};

    var nodes = new Queue<HtmlNode>(document.DocumentNode.SelectNodes("./*|./text()"));
    while(nodes.Count > 0)
    {
        var node = nodes.Dequeue();
        var parentNode = node.ParentNode;

        if(!acceptableTags.Contains(node.Name) && node.Name != "#text")
        {
            var childNodes = node.SelectNodes("./*|./text()");

            if (childNodes != null)
            {
                foreach (var child in childNodes)
                {
                    nodes.Enqueue(child);
                    parentNode.InsertBefore(child, node);
                }
            }

            parentNode.RemoveChild(node);

        }
    }

    return document.DocumentNode.InnerHtml;
}

Answer 2

如何以遞歸方式從html字符串中刪除不需要的html標記列表

我接受了@mathias的回答並改進了他的擴展方法，以便您可以提供要作為List<string>排除的標記List<string> （例如{"a","p","hr"} ）。 我還修復了邏輯，以便它以遞歸方式正常工作：

public static string RemoveUnwantedHtmlTags(this string html, List<string> unwantedTags)
    {
        if (String.IsNullOrEmpty(html))
        {
            return html;
        }

        var document = new HtmlDocument();
        document.LoadHtml(html);

        HtmlNodeCollection tryGetNodes = document.DocumentNode.SelectNodes("./*|./text()");

        if (tryGetNodes == null || !tryGetNodes.Any())
        {
            return html;
        }

        var nodes = new Queue<HtmlNode>(tryGetNodes);

        while (nodes.Count > 0)
        {
            var node = nodes.Dequeue();
            var parentNode = node.ParentNode;

            var childNodes = node.SelectNodes("./*|./text()");

            if (childNodes != null)
            {
                foreach (var child in childNodes)
                {
                    nodes.Enqueue(child);                       
                }
            }

            if (unwantedTags.Any(tag => tag == node.Name))
            {               
                if (childNodes != null)
                {
                    foreach (var child in childNodes)
                    {
                        parentNode.InsertBefore(child, node);
                    }
                }

                parentNode.RemoveChild(node);

            }
        }

        return document.DocumentNode.InnerHtml;
    }

Answer 3

嘗試以下方法，您可能會發現它比其他提議的解決方案更整潔：

public static int RemoveNodesButKeepChildren(this HtmlNode rootNode, string xPath)
{
    HtmlNodeCollection nodes = rootNode.SelectNodes(xPath);
    if (nodes == null)
        return 0;
    foreach (HtmlNode node in nodes)
        node.RemoveButKeepChildren();
    return nodes.Count;
}

public static void RemoveButKeepChildren(this HtmlNode node)
{
    foreach (HtmlNode child in node.ChildNodes)
        node.ParentNode.InsertBefore(child, node);
    node.Remove();
}

public static bool TestYourSpecificExample()
{
    string html = "<p>my paragraph <div>and my <b>div</b></div> are <i>italic</i> and <b>bold</b></p>";
    HtmlDocument document = new HtmlDocument();
    document.LoadHtml(html);
    document.DocumentNode.RemoveNodesButKeepChildren("//div");
    document.DocumentNode.RemoveNodesButKeepChildren("//p");
    return document.DocumentNode.InnerHtml == "my paragraph and my <b>div</b> are <i>italic</i> and <b>bold</b>";
}

Answer 4

在刪除節點之前，獲取其父節點及其InnerText ，然后刪除節點並將InnerText重新分配給父節點。

var parent = node.ParentNode;
var innerText = parent.InnerText;
node.Remove();
parent.AppendChild(doc.CreateTextNode(innerText));

Answer 5

如果您不想使用Html敏捷包並仍想刪除不需要的Html標記，則可以執行以下操作。

public static string RemoveHtmlTags(string strHtml)
    {
        string strText = Regex.Replace(strHtml, "<(.|\n)*?>", String.Empty);
        strText = HttpUtility.HtmlDecode(strText);
        strText = Regex.Replace(strText, @"\s+", " ");
        return strText;
    }

HTML敏捷包 - 刪除不需要的標簽而不刪除內容？

問題描述

5 個解決方案

解決方案1
55 已采納 2012-10-11 10:00:24

解決方案2
13 2015-02-03 12:23:47

如何以遞歸方式從html字符串中刪除不需要的html標記列表

解決方案3
8 2014-07-23 14:29:53

解決方案4
4 2012-10-08 18:34:37

解決方案5
3 2015-05-04 08:54:14

HTML敏捷包 - 刪除不需要的標簽而不刪除內容？

問題描述

5 個解決方案

解決方案1 55 已采納 2012-10-11 10:00:24

解決方案2 13 2015-02-03 12:23:47

如何以遞歸方式從html字符串中刪除不需要的html標記列表

解決方案3 8 2014-07-23 14:29:53

解決方案4 4 2012-10-08 18:34:37

解決方案5 3 2015-05-04 08:54:14

解決方案1
55 已采納 2012-10-11 10:00:24

解決方案2
13 2015-02-03 12:23:47

解決方案3
8 2014-07-23 14:29:53

解決方案4
4 2012-10-08 18:34:37

解決方案5
3 2015-05-04 08:54:14