[英]HtmlAgility - extract and replace plain text part (outside any tags) from HTML
I use HtmlAgility
pack and I want to extract and replace each plain text part (not inside tags) from HTML. 我使用HtmlAgility
包,并且想从HTML中提取并替换每个纯文本部分(而不是标签内)。
<html><body>bla bla 1<br />bla bla 2<br />bla bla 3<img src="img.jpg" /></body></html>
The output should be a list including bla bla 1
; 输出应为包含bla bla 1
的列表; bla bla 2
; bla bla 2
; bla bla 3
; bla bla 3
;
node.InnerText
does not apply here. node.InnerText
在这里不适用。
I used : 我用了 :
// loop over innerhtml and process
var thenode = document.DocumentNode.Descendants().Where(n => n.Name == "body").FirstOrDefault();
if (thenode != null)
{
// InnerHtml replaces <br /> with <br>
String[] strings = thenode.InnerHtml.Split(new string[] { "<br>" }, StringSplitOptions.RemoveEmptyEntries);
foreach (String str in strings)
{
String lstr = str.Trim();
if (lstr != String.Empty && !lstr.StartsWith("<"))
{
// do processing
String loutput = Processing(lstr);
thenode.InnerHtml = thenode.InnerHtml.Replace(lstr, loutput);
}
}
}
One possible way to replace all text nodes within <body>
tag with some new text : 一种将<body>
标记内的所有文本节点替换为一些新文本的可能方法:
//select all text nodes that is "direct child of <body>" and "not empty"
var textNodes = doc.DocumentNode.SelectNodes("//body/text()[normalize-space()]");
foreach (HtmlNode textNode in textNodes)
{
textNode.ParentNode
//replace each text node with "new text" for the sake of demo
.ReplaceChild(HtmlNode.CreateNode("new text")
, textNode
);
}
Side note: I didn't see the text nodes as outside any tag , because they are inside the <body>
tag. 旁注:我没有看到文本节点在任何标签外部 ,因为它们在 <body>
标签内部 。 I see them as direct child of <body>
tag. 我将它们视为<body>
标记的直接子代 。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.