简体   繁体   English

AngleSharp文本元素解析

[英]AngleSharp text element parsing

I am developing a limited browser functionality with AngleSharp . 我正在使用AngleSharp开发有限的浏览器功能。 The way it parses the HTML made me little bit confused. 它解析HTML的方式使我有些困惑。 For example, the content of the following "div" is parsed as one BR child element and a TextContent property with text "te st". 例如,以下“ div”的内容被解析为一个BR子元素和一个带有文本“ te st”的TextContent属性。 So it is impossible to find the position of BR element in the text. 因此,不可能在文本中找到BR元素的位置。

 <div>te<br />st</div> 

I think it would be better if there were 3 child components for the DIV. 我认为如果DIV有3个子组件会更好。 first one is a text element with content "te" and then a BR element follwing by another text element with content "st". 第一个是内容为“ te”的文本元素,然后是BR元素与另一个内容为“ st”的文本元素相随。

Is there any alternative solution for this? 是否有其他替代解决方案?

Actually, it will yield the expected result. 实际上,它将产生预期的结果。 AngleSharp's DOM (and HTML5 compliant parser) works according to the W3C specification. AngleSharp的DOM(和兼容HTML5的解析器)根据W3C规范工作。 As such there should be little surprise (as compared to evergreen browsers). 因此,应该没有什么惊喜(与常绿浏览器相比)。

var text = "<div>te<br/>st</div>";
var context = BrowsingContext.New();
var document = context.OpenAsync(m => m.Content(text)).Result;
var div = document.Body.QuerySelector("div");

Console.WriteLine(div.ChildNodes.Length);

foreach (var child in div.ChildNodes)
{
    Console.WriteLine(child.NodeName);
    Console.WriteLine(child.TextContent);
}

The output is 输出是

3
#text
te
BR

#text
st

Hence we have (text node, BR element, text node). 因此,我们有了(文本节点,BR元素,文本节点)。 Hope this helps! 希望这可以帮助!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM