简体   繁体   English

Anglesharp规范化/修复HTML

[英]Anglesharp normalize / fix html

I have this piece of html 我有这段HT​​ML

<div>
  Outside paragraph
  <p>In paragraph</p>
</div>

As you can see there's Outside paragraph piece of text which, being outside paragraph, is not wanted situation. 如您所见,有一段“ Outside paragraph ”文本是在外部段落中,这是不需要的情况。

Is there any AngleSharp method (if not Anglesharp then any other) which would allow me to normalize / fix this piece of html so it looks like: 是否有任何AngleSharp方法(如果不是Anglesharp,则为其他任何方法)可以使我规范化/修复这部分html,因此它看起来像:

<div>
  <p>Outside paragraph</p>
  <p>In paragraph</p>
</div>

So, a piece of code which will put Outside paragraph in to paragraph 因此,一段代码会将Outside paragraph放入

AngleSharp does not provide such custom logic, but gives you mean to roll out your own normalization schemes. AngleSharp不提供这种自定义逻辑,但是给您提供了推出自己的规范化方案的手段。

In the following example I use the TreeWalker to simplify iterating over only text nodes. 在下面的示例中,我使用TreeWalker简化了仅对文本节点的迭代。

The code looks for the given conditions to insert the paragraph dynamically. 该代码查找给定条件以动态插入该段落。

var context = BrowsingContext.New();
var document = await context.OpenAsync(res => res.Content("foo<div>Outside<p>Inside</p></div>bar"));
var walker = document.CreateTreeWalker(document.Body, AngleSharp.Dom.FilterSettings.Text);

while (walker.ToNext() != null)
{
    var current = walker.Current;

    // if just whitespace, e.g., formatting line breaks, or in p anyway - skip
    if (
        (current.TextContent.Trim().Length == 0) ||
        (current.ParentElement.LocalName == "p"))
    {
        continue;
    }
    // if next to paragraph perform the normalization
    else if (
        (current.PreviousSibling is IElement previous && previous.LocalName == "p") ||
        (current.NextSibling is IElement next && next.LocalName == "p"))
    {
        var newNode = document.CreateElement("p");
        current.ReplaceWith(newNode);
        newNode.Append(current);
    }
}

document.Body.ToHtml().Dump();

The dumped result looks as follows: 转储的结果如下所示:

<body>foo<div><p>Outside</p><p>Inside</p></div>bar</body>

This is potentially not everything what you need, but should give you the pointer in the right direction. 这可能不是您所需要的一切 ,但应该为您提供正确方向的指针。

Note : You can also roll your own (recursive) iteration or use, eg, a custom IMarkupFormatter to make the normalization as serialization. 注意 :您也可以滚动自己的(递归)迭代或使用自定义IMarkupFormatter进行标准化为序列化。 There are multiple ways. 有多种方法。 The given one changes the DOM - as such further operations (not just serialization) may be possible. 给定的一个更改了DOM-这样可能会进行进一步的操作(而不仅仅是序列化)。

Hope that helps! 希望有帮助!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM