简体   繁体   English

AngleSharp 提取格式化文本

[英]AngleSharp extracting formatted text

I'm wondering if it's possible to extract out formatted text from a HTMLDocument using AngleSharp.我想知道是否可以使用 AngleSharp 从 HTMLDocument 中提取格式化文本。 I'm using the following code to extract the text.我正在使用以下代码来提取文本。 The problem I have is that the extracted text runs together, there is no break between each of the elements.我遇到的问题是提取的文本一起运行,每个元素之间没有中断。

var parser = new HtmlParser();
var document = parser.Parse("<script>var x = 1;</script> <h1>Some example source</h1><p>This is a paragraph element</p>");
var text = document.Body.Text();

This returns the following text这将返回以下文本

Some example sourceThis is a paragraph element一些示例来源这是一个段落元素

Ideally I would like it to return Some example source This is a paragraph element where there is some separation between each of the nodes text values.理想情况下,我希望它返回一些示例源 这是一个段落元素,其中每个节点文本值之间存在一些分隔。

I know I am late to the party, but better late than never (also I hope someone else benefits from this answer).我知道我参加聚会迟到了,但迟到总比不到好(我也希望其他人从这个答案中受益)。

The comments on the question are both right.题主的评论都对。 On the one hand we have the W3C specification and the document's source, which tells us that there won't be any space in the (official) serialization, on the other hand we have quite a common case to "integrate" some spaces when applicable (or maybe even newlines, eg, if a <br> element is seen).一方面,我们有 W3C 规范和文档的来源,它告诉我们(官方)序列化中不会有任何空格,另一方面,我们有一个非常常见的案例,可以在适用时“集成”一些空格(或者甚至换行,例如,如果看到一个<br>元素)。

That being written the library does not know your specific use case (ie, when you want to insert spaces).正在编写的库不知道您的特定用例(即,当想要插入空格时)。 However, it can assist you to get more easily to your desired state .但是,它可以帮助您更轻松地达到您想要的状态

Serialization from the DOM to a string is done via an instance of a class that implements IMarkupFormatter .从 DOM 到字符串的序列化是通过实现IMarkupFormatter的类的实例完成的。 The ToHtml() method of any DOM node accepts such an object to return a string.任何 DOM 节点的ToHtml()方法都接受这样一个对象来返回一个字符串。 Doing a做一个

var myFormatter = new MyMarkupFormatter();
var text = document.Body.ToHtml(myFormatter);

Now the question is reduced to an implementation of MyMarkupFormatter that works for us.现在问题简化为适用于我们的 MyMarkupFormatter 的实现。 This formatter will essentially only yield text nodes, however, with certain tags being treated differently (ie, returning some text such as spaces).这个格式化程序基本上只会产生文本节点,但是,某些标签被不同地处理(即返回一些文本,如空格)。

public class MyMarkupFormatter : IMarkupFormatter
{
    String IMarkupFormatter.Comment(IComment comment)
    {
        return String.Empty;
    }

    String IMarkupFormatter.Doctype(IDocumentType doctype)
    {
        return String.Empty;
    }

    String IMarkupFormatter.Processing(IProcessingInstruction processing)
    {
        return String.Empty;
    }

    String IMarkupFormatter.Text(ICharacterData text)
    {
        return text.Data;
    }

    String IMarkupFormatter.OpenTag(IElement element, Boolean selfClosing)
    {
        switch (element.LocalName)
        {
            case "p":
                return "\n\n";
            case "br":
                return "\n";
            case "span":
                return " ";
        }

        return String.Empty;
    }

    String IMarkupFormatter.CloseTag(IElement element, Boolean selfClosing)
    {
        return String.Empty;
    }

    String IMarkupFormatter.Attribute(IAttr attr)
    {
        return String.Empty;
    }
}

If stripping all non-text info is not what you need then AngleSharp also offers the PrettyMarkupFormatter out of the box - maybe this is already quite close to what you wanted (a "prettier" markup formatter).如果剥离所有非文本信息不是您所需要的,那么 AngleSharp 还提供开箱即用的PrettyMarkupFormatter - 也许这已经非常接近您想要的(“更漂亮”的标记格式化程序)。

Hope this helps!希望这可以帮助!

Here's my implementation of IMarkupFormatter.这是我对 IMarkupFormatter 的实现。 It improves upon Florian's example because it adds line breaks for any block-level element, not just paragraphs.它改进了 Florian 的示例,因为它为任何块级元素添加了换行符,而不仅仅是段落。 It puts a line break before and after each block-level element, to ensure text from block elements is not put on the same line as text from other nodes.它在每个块级元素之前和之后放置一个换行符,以确保来自块元素的文本不会与来自其他节点的文本放在同一行。 Just like the accepted answer, my implementation uses only one line break for <br> elements.就像公认的答案一样,我的实现只对 <br> 元素使用了一个换行符。 Lastly, it does not add spaces to <span> elements or other inline elements.最后,它不会向 <span> 元素或其他内联元素添加空格。 Instead, it preserves whitespace that was already present in the original HTML string.相反,它保留了原始 HTML 字符串中已经存在的空格。

using AngleSharp;
using AngleSharp.Dom;

public class TextMarkupFormatter : IMarkupFormatter
{
    public string Text(ICharacterData text)
    {
        return text.Data;
    }

    public string LiteralText(ICharacterData text)
    {
        return "";
    }

    public string Comment(IComment comment)
    {
        return "";
    }

    public string Processing(IProcessingInstruction processing)
    {
        return "";
    }

    public string Doctype(IDocumentType doctype)
    {
        return "";
    }

    public string OpenTag(IElement element, bool selfClosing)
    {
        if (IsBlockLevelElement(element))
            return "\n";

        return "";
    }

    public string CloseTag(IElement element, bool selfClosing)
    {
        if (IsBlockLevelElement(element) || element.TagName == "BR")
            return "\n";

        return "";
    }

    private bool IsBlockLevelElement(IElement element)
    {
        switch (element.TagName)
        {
            case "ADDRESS":
            case "ARTICLE":
            case "ASIDE":
            case "BLOCKQUOTE":
            case "DETAILS":
            case "DIALOG":
            case "DD":
            case "DIV":
            case "DL":
            case "FIELDSET":
            case "FIGCAPTION":
            case "FIGURE":
            case "FOOTER":
            case "FORM":
            case "H1":
            case "H2":
            case "H3":
            case "H4":
            case "H5":
            case "H6":
            case "HEADER":
            case "HGROUP":
            case "HR":
            case "LI":
            case "MAIN":
            case "NAV":
            case "OL":
            case "P":
            case "PRE":
            case "SECTION":
            case "TABLE":
            case "UL":
                return true;

            default:
                return false;
        }
    }
}

If you're working with HTML in a string, and not a full HTML document, you can parse and format it like so:如果您在字符串中使用 HTML 而不是完整的 HTML 文档,则可以像这样解析和格式化它:

using var writer = new StringWriter();

new HtmlParser().ParseFragment("Hello<div>World</div>", null).ToHtml(writer, new TextMarkupFormatter());

var text = writer.ToString().Trim();

Console.WriteLine(text); // Writes "Hello\nWorld"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM