[英]Why doesn't AngleSharp generate TextNodes for interleaved text?
I'm trying to parse some HTML using the AngleSharp library, which has been great so far. 我正在尝试使用AngleSharp库解析一些HTML,到目前为止,它很棒。 I now stumble upon a scenario where I'd like to parse the following piece of HTML: 现在,我偶然发现了一个我想解析以下HTML片段的场景:
<a name="someLink" href="#someLink">Link 1</a>
Some text that happens to be in between elements...
<b>Some stuff in bold</b>
Some more text
<br>
Of course, this piece of HTML has enclosing parent elements etc, but the resulting list of parsed elements for this piece of HTML is: 当然,这段HTML包含了父元素等,但是,这段HTML的已解析元素的最终列表是:
Effectively skipping the text in between elements. 有效地跳过元素之间的文本。 How do I obtain this text? 我如何获得此文本? I would think AngleSharp would generate TextNodes for these parts? 我认为AngleSharp会为这些部分生成TextNodes吗?
Note that fetching the parent's complete TextContent isn't what I want to do, since I still actually need the structure of the elements to know what is what. 请注意,获取父级的完整TextContent不是我想要做的,因为我实际上仍然需要元素的结构才能知道是什么。
This behavior is actually what's expected by the DOM spec. 这种行为实际上是DOM规范所期望的。 You may not realize this, but you've answered your own question :) 您可能没有意识到这一点,但是您已经回答了自己的问题:)
Here's what you seem to get not quite right : Element != Node . 这是您似乎不太正确的内容 : Element!= Node 。 You asked for the elements, but you're looking for the nodes. 您需要元素,但您正在寻找节点。
Tags like <a>
etc end up as elements, whereas text nodes are... well... nodes, not elements. 像<a>
这样的标签最终都以元素的形式出现,而文本节点是...很好...节点,而不是元素。 And you're asking the API to give you the elements. 您正在要求API给您元素。 In other words, you're telling the API you don't want the text nodes to be returned. 换句话说,您是在告诉API您不希望返回文本节点。
Let's do a simple demo. 让我们做一个简单的演示。
var parser = new HtmlParser();
var doc = parser.Parse(@"<div id=""content"">
<a name=""someLink"" href=""#someLink"">Link 1</a>
Some text that happens to be in between elements...
<b>Some stuff in bold</b>
Some more text
<br>
</div>");
var content = doc.GetElementById("content");
Now, here's essentially what you've been doing : 现在,这基本上就是您正在做的事情 :
foreach (var element in content.Children)
Console.WriteLine(element.GetType().Name);
This outputs: 输出:
HtmlAnchorElement HtmlAnchorElement
HtmlBoldElement HtmlBoldElement
HtmlBreakRowElement HtmlBreakRowElement
Here's what you want instead: 这是您想要的 :
foreach (var element in content.ChildNodes)
Console.WriteLine(element.GetType().Name);
Now the output is: 现在的输出是:
TextNode TextNode
HtmlAnchorElement HtmlAnchorElement
TextNode TextNode
HtmlBoldElement HtmlBoldElement
TextNode TextNode
HtmlBreakRowElement HtmlBreakRowElement
TextNode TextNode
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.