为什么AngleSharp不为交错文本生成TextNode？

Question

I'm trying to parse some HTML using the AngleSharp library, which has been great so far. 我正在尝试使用AngleSharp库解析一些HTML，到目前为止，它很棒。 I now stumble upon a scenario where I'd like to parse the following piece of HTML: 现在，我偶然发现了一个我想解析以下HTML片段的场景：

<a name="someLink" href="#someLink">Link 1</a>
Some text that happens to be in between elements...
<b>Some stuff in bold</b>
Some more text
<br>

Of course, this piece of HTML has enclosing parent elements etc, but the resulting list of parsed elements for this piece of HTML is: 当然，这段HTML包含了父元素等，但是，这段HTML的已解析元素的最终列表是：

HtmlAnchorElement HtmlAnchorElement
HtmlBoldElement HtmlBoldElement
HtmlBreakRowElement HtmlBreakRowElement

Effectively skipping the text in between elements. 有效地跳过元素之间的文本。 How do I obtain this text? 我如何获得此文本？ I would think AngleSharp would generate TextNodes for these parts? 我认为AngleSharp会为这些部分生成TextNodes吗？

Note that fetching the parent's complete TextContent isn't what I want to do, since I still actually need the structure of the elements to know what is what. 请注意，获取父级的完整TextContent不是我想要做的，因为我实际上仍然需要元素的结构才能知道是什么。

Answer 1

This behavior is actually what's expected by the DOM spec. 这种行为实际上是DOM规范所期望的。 You may not realize this, but you've answered your own question :) 您可能没有意识到这一点，但是您已经回答了自己的问题:)

Here's what you seem to get not quite right : Element != Node . 这是您似乎不太正确的内容 ： Element！= Node 。 You asked for the elements, but you're looking for the nodes. 您需要元素，但您正在寻找节点。

Tags like <a> etc end up as elements, whereas text nodes are... well... nodes, not elements. 像<a>这样的标签最终都以元素的形式出现，而文本节点是...很好...节点，而不是元素。 And you're asking the API to give you the elements. 您正在要求API给您元素。 In other words, you're telling the API you don't want the text nodes to be returned. 换句话说，您是在告诉API您不希望返回文本节点。

Let's do a simple demo. 让我们做一个简单的演示。

var parser = new HtmlParser();
var doc = parser.Parse(@"<div id=""content"">
        <a name=""someLink"" href=""#someLink"">Link 1</a>
        Some text that happens to be in between elements...
        <b>Some stuff in bold</b>
        Some more text
        <br>
    </div>");
var content = doc.GetElementById("content");

Now, here's essentially what you've been doing : 现在，这基本上就是您正在做的事情：

foreach (var element in content.Children)
    Console.WriteLine(element.GetType().Name);

This outputs: 输出：

HtmlAnchorElement HtmlAnchorElement
HtmlBoldElement HtmlBoldElement
HtmlBreakRowElement HtmlBreakRowElement

Here's what you want instead: 这是您想要的：

foreach (var element in content.ChildNodes)
    Console.WriteLine(element.GetType().Name);

Now the output is: 现在的输出是：

TextNode TextNode
HtmlAnchorElement HtmlAnchorElement
TextNode TextNode
HtmlBoldElement HtmlBoldElement
TextNode TextNode
HtmlBreakRowElement HtmlBreakRowElement
TextNode TextNode

为什么AngleSharp不为交错文本生成TextNode？

问题描述

1 个解决方案

解决方案1
6 已采纳 2016-01-30 17:26:04

为什么AngleSharp不为交错文本生成TextNode？

问题描述

1 个解决方案

解决方案1 6 已采纳 2016-01-30 17:26:04

解决方案1
6 已采纳 2016-01-30 17:26:04