简体   繁体   中英

AngleSharp text element parsing

I am developing a limited browser functionality with AngleSharp . The way it parses the HTML made me little bit confused. For example, the content of the following "div" is parsed as one BR child element and a TextContent property with text "te st". So it is impossible to find the position of BR element in the text.

 <div>te<br />st</div> 

I think it would be better if there were 3 child components for the DIV. first one is a text element with content "te" and then a BR element follwing by another text element with content "st".

Is there any alternative solution for this?

Actually, it will yield the expected result. AngleSharp's DOM (and HTML5 compliant parser) works according to the W3C specification. As such there should be little surprise (as compared to evergreen browsers).

var text = "<div>te<br/>st</div>";
var context = BrowsingContext.New();
var document = context.OpenAsync(m => m.Content(text)).Result;
var div = document.Body.QuerySelector("div");

Console.WriteLine(div.ChildNodes.Length);

foreach (var child in div.ChildNodes)
{
    Console.WriteLine(child.NodeName);
    Console.WriteLine(child.TextContent);
}

The output is

3
#text
te
BR

#text
st

Hence we have (text node, BR element, text node). Hope this helps!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM