简体繁体 English

如何使用HTML Agility Pack删除一些（或全部）HTML元素和/或属性？

[英]How do I remove some (or all) HTML elements and/or attributes using HTML Agility Pack?

原文 2010-02-28 17:56:09 3 1 c#/ .net/ html-parsing

Using the HTML Agility Pack , how can I remove all HTML attributes, elements, etc, etc, from a blob of HTML, with the result as if I pasted it into notepad? 使用HTML Agility Pack ，如何从HTML斑点中删除所有HTML属性，元素等，结果就像将其粘贴到记事本中一样？

Additionally, I need to remove all formatting but I need to keep UL/LI and B tags. 此外，我需要删除所有格式，但需要保留UL / LI和B标签。

1 个解决方案

Enter the html into an HtmlDocument instance, you can get the HtmlNode returned by the DocumentNode property, and from there, get the InnerText property of the document node. 将html输入到HtmlDocument实例中，可以获取DocumentNode属性返回的HtmlNode，然后从该位置获取文档节点的InnerText属性。 It will give you all the text stripped of HTML tags. 它将为您提供剥离了HTML标签的所有文本。

If you want to only include a particular subset of nodes in your filtering, then it's going to be a little more difficult. 如果您只想在过滤中包括特定的节点子集，那将变得有些困难。

First, you would load the content into an HtmlDocument instance and get the HtmlNode instance returned by the DocumentNode property (I'll refer to this node from this document as the root node). 首先，您将内容加载到HtmlDocument实例中，并获取由DocumentNode属性返回的HtmlNode实例（我将从本文档中将此节点称为根节点）。

At the same time, you would also create a second HtmlDocument instance which would contain the new document you are creating. 同时，您还将创建另一个HtmlDocument实例，该实例将包含您要创建的新文档。

On the first document, you would iterate through the root node recursively (note, it doesn't have to be an actual recursive method, but semantically it would be recursive behavior), analyzing the node and all of it's children nodes. 在第一个文档中，您将递归遍历根节点（请注意，它不一定是实际的递归方法，但是从语义上讲，这将是递归行为），分析该节点及其所有子节点。

If the node itself is one of the nodes you approve of, then you would begin to construct a new instance of that node. 如果该节点本身是您批准的节点之一，那么您将开始构造该节点的新实例。

However, if it is not, you would still process the child nodes of the element, getting the text node content (since text in itself is a node) and appending it to whatever current node is on the stack (if there is one). 但是，如果不是，您仍将处理该元素的子节点，获取文本节点的内容（因为文本本身就是一个节点），并将其附加到堆栈上的任何当前节点（如果有）。