[英]Get tags around text in HTML document using C#
I would like to search an HTML file for a certain string and then extract the tags.我想在 HTML 文件中搜索某个字符串,然后提取标签。 Given:
鉴于:
<div_outer><div_inner>Happy birthday<div><div>
I would like to search the HTML for "Happy birthday" then have a function return some sort of tag structure: this is the innermost tag, this is the tag outside that one, etc. So, <div_inner></div>
then <div_outer></div>
.我想在 HTML 中搜索“生日快乐”,然后让 function 返回某种标签结构:这是最里面的标签,这是那个标签外面的标签,等等。所以,
<div_inner></div>
然后<div_outer></div>
。
Any ideas?有任何想法吗? I am thinking HTMLAgilityPack but I haven't been able to figure out how to do it.
我正在考虑 HTMLAgilityPack,但我一直无法弄清楚如何去做。
Thanks as always, guys.一如既往地感谢你们。
The HAP is a good place indeed for this. HAP 确实是一个很好的地方。
You can use the OuterHtml
and Parent
properties of a Node
to get the enclosing elements and markup.您可以使用
Node
的OuterHtml
和Parent
属性来获取封闭元素和标记。
You could use xpath for this.您可以为此使用 xpath。 I use
//*[text()='Happy birthday'][1]/ancestor-or-self::*
expression which finds a first (for simplicity) node which text content is Happy birthday
, and then returns all the ancestors (parent, grandparent, etc.) of this node and the node itself:我使用
//*[text()='Happy birthday'][1]/ancestor-or-self::*
表达式找到文本内容为Happy birthday
的第一个(为简单起见)节点,然后返回所有祖先此节点和节点本身的(父节点、祖父节点等):
var doc = new HtmlDocument();
doc.LoadHtml("<div_outer><div_inner>Happy birthday<div><div>");
var ancestors = doc.DocumentNode
.SelectNodes("//*[text()='Happy birthday'][1]/ancestor-or-self::*")
.Reverse()
.ToList();
It seems that the order of the nodes returned is the order the nodes found in the document, so I used Enumerable.Reverse
method to reverse it.返回的节点顺序好像是文档中找到的节点顺序,所以我用了
Enumerable.Reverse
方法来反转。
This will return 2 nodes: div_inner
and div_outer
.这将返回 2 个节点:
div_inner
和div_outer
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.