简体   繁体   English

为什么AngleSharp不为交错文本生成TextNode?

[英]Why doesn't AngleSharp generate TextNodes for interleaved text?

I'm trying to parse some HTML using the AngleSharp library, which has been great so far. 我正在尝试使用AngleSharp库解析一些HTML,到目前为止,它很棒。 I now stumble upon a scenario where I'd like to parse the following piece of HTML: 现在,我偶然发现了一个我想解析以下HTML片段的场景:

<a name="someLink" href="#someLink">Link 1</a>
Some text that happens to be in between elements...
<b>Some stuff in bold</b>
Some more text
<br>

Of course, this piece of HTML has enclosing parent elements etc, but the resulting list of parsed elements for this piece of HTML is: 当然,这段HTML包含了父元素等,但是,这段HTML的已解析元素的最终列表是:

  • HtmlAnchorElement HtmlAnchorElement
  • HtmlBoldElement HtmlBoldElement
  • HtmlBreakRowElement HtmlBreakRowElement

Effectively skipping the text in between elements. 有效地跳过元素之间的文本。 How do I obtain this text? 我如何获得此文本? I would think AngleSharp would generate TextNodes for these parts? 我认为AngleSharp会为这些部分生成TextNodes吗?

Note that fetching the parent's complete TextContent isn't what I want to do, since I still actually need the structure of the elements to know what is what. 请注意,获取父级的完整TextContent不是我想要做的,因为我实际上仍然需要元素的结构才能知道是什么。

This behavior is actually what's expected by the DOM spec. 这种行为实际上是DOM规范所期望的。 You may not realize this, but you've answered your own question :) 您可能没有意识到这一点,但是您已经回答了自己的问题:)

Here's what you seem to get not quite right : Element != Node . 这是您似乎不太正确的内容Element!= Node You asked for the elements, but you're looking for the nodes. 您需要元素,但您正在寻找节点。

Tags like <a> etc end up as elements, whereas text nodes are... well... nodes, not elements. <a>这样的标签最终都以元素的形式出现,而文本节点是...很好...节点,而不是元素。 And you're asking the API to give you the elements. 您正在要求API给您元素。 In other words, you're telling the API you don't want the text nodes to be returned. 换句话说,您是在告诉API您不希望返回文本节点。

Let's do a simple demo. 让我们做一个简单的演示。

var parser = new HtmlParser();
var doc = parser.Parse(@"<div id=""content"">
        <a name=""someLink"" href=""#someLink"">Link 1</a>
        Some text that happens to be in between elements...
        <b>Some stuff in bold</b>
        Some more text
        <br>
    </div>");
var content = doc.GetElementById("content");

Now, here's essentially what you've been doing : 现在,这基本上就是您正在做的事情

foreach (var element in content.Children)
    Console.WriteLine(element.GetType().Name);

This outputs: 输出:

HtmlAnchorElement HtmlAnchorElement
HtmlBoldElement HtmlBoldElement
HtmlBreakRowElement HtmlBreakRowElement

Here's what you want instead: 是您想要的

foreach (var element in content.ChildNodes)
    Console.WriteLine(element.GetType().Name);

Now the output is: 现在的输出是:

TextNode TextNode
HtmlAnchorElement HtmlAnchorElement
TextNode TextNode
HtmlBoldElement HtmlBoldElement
TextNode TextNode
HtmlBreakRowElement HtmlBreakRowElement
TextNode TextNode

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM