使用 C# 获取 HTML 文档中文本周围的标签

Question

I would like to search an HTML file for a certain string and then extract the tags.我想在 HTML 文件中搜索某个字符串，然后提取标签。 Given:鉴于：

<div_outer><div_inner>Happy birthday<div><div>

I would like to search the HTML for "Happy birthday" then have a function return some sort of tag structure: this is the innermost tag, this is the tag outside that one, etc. So, <div_inner></div> then <div_outer></div> .我想在 HTML 中搜索“生日快乐”，然后让 function 返回某种标签结构：这是最里面的标签，这是那个标签外面的标签，等等。所以， <div_inner></div>然后<div_outer></div> 。

Any ideas?有任何想法吗？ I am thinking HTMLAgilityPack but I haven't been able to figure out how to do it.我正在考虑 HTMLAgilityPack，但我一直无法弄清楚如何去做。

Thanks as always, guys.一如既往地感谢你们。

Answer 1

The HAP is a good place indeed for this. HAP 确实是一个很好的地方。

You can use the OuterHtml and Parent properties of a Node to get the enclosing elements and markup.您可以使用Node的OuterHtml和Parent属性来获取封闭元素和标记。

Answer 2

You could use xpath for this.您可以为此使用 xpath。 I use //*[text()='Happy birthday'][1]/ancestor-or-self::* expression which finds a first (for simplicity) node which text content is Happy birthday , and then returns all the ancestors (parent, grandparent, etc.) of this node and the node itself:我使用//*[text()='Happy birthday'][1]/ancestor-or-self::*表达式找到文本内容为Happy birthday的第一个（为简单起见）节点，然后返回所有祖先此节点和节点本身的（父节点、祖父节点等）：

var doc = new HtmlDocument();
doc.LoadHtml("<div_outer><div_inner>Happy birthday<div><div>");

var ancestors = doc.DocumentNode
    .SelectNodes("//*[text()='Happy birthday'][1]/ancestor-or-self::*")
    .Reverse()
    .ToList();

It seems that the order of the nodes returned is the order the nodes found in the document, so I used Enumerable.Reverse method to reverse it.返回的节点顺序好像是文档中找到的节点顺序，所以我用了Enumerable.Reverse方法来反转。

This will return 2 nodes: div_inner and div_outer .这将返回 2 个节点： div_inner和div_outer 。

使用 C# 获取 HTML 文档中文本周围的标签

问题描述

2 个解决方案

解决方案1
2 已采纳 2012-04-04 19:46:27

解决方案2
1 2012-04-04 21:52:18

使用 C# 获取 HTML 文档中文本周围的标签

问题描述

2 个解决方案

解决方案1 2 已采纳 2012-04-04 19:46:27

解决方案2 1 2012-04-04 21:52:18

解决方案1
2 已采纳 2012-04-04 19:46:27

解决方案2
1 2012-04-04 21:52:18