简体   繁体   English

使用 C# 获取 HTML 文档中文本周围的标签

[英]Get tags around text in HTML document using C#

I would like to search an HTML file for a certain string and then extract the tags.我想在 HTML 文件中搜索某个字符串,然后提取标签。 Given:鉴于:

<div_outer><div_inner>Happy birthday<div><div>

I would like to search the HTML for "Happy birthday" then have a function return some sort of tag structure: this is the innermost tag, this is the tag outside that one, etc. So, <div_inner></div> then <div_outer></div> .我想在 HTML 中搜索“生日快乐”,然后让 function 返回某种标签结构:这是最里面的标签,这是那个标签外面的标签,等等。所以, <div_inner></div>然后<div_outer></div>

Any ideas?有任何想法吗? I am thinking HTMLAgilityPack but I haven't been able to figure out how to do it.我正在考虑 HTMLAgilityPack,但我一直无法弄清楚如何去做。

Thanks as always, guys.一如既往地感谢你们。

The HAP is a good place indeed for this. HAP 确实是一个很好的地方。

You can use the OuterHtml and Parent properties of a Node to get the enclosing elements and markup.您可以使用NodeOuterHtmlParent属性来获取封闭元素和标记。

You could use xpath for this.您可以为此使用 xpath。 I use //*[text()='Happy birthday'][1]/ancestor-or-self::* expression which finds a first (for simplicity) node which text content is Happy birthday , and then returns all the ancestors (parent, grandparent, etc.) of this node and the node itself:我使用//*[text()='Happy birthday'][1]/ancestor-or-self::*表达式找到文本内容为Happy birthday的第一个(为简单起见)节点,然后返回所有祖先此节点和节点本身的(父节点、祖父节点等):

var doc = new HtmlDocument();
doc.LoadHtml("<div_outer><div_inner>Happy birthday<div><div>");

var ancestors = doc.DocumentNode
    .SelectNodes("//*[text()='Happy birthday'][1]/ancestor-or-self::*")
    .Reverse()
    .ToList();

It seems that the order of the nodes returned is the order the nodes found in the document, so I used Enumerable.Reverse method to reverse it.返回的节点顺序好像是文档中找到的节点顺序,所以我用了Enumerable.Reverse方法来反转。

This will return 2 nodes: div_inner and div_outer .这将返回 2 个节点: div_innerdiv_outer

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM