简体   繁体   中英

AngleSharp extracting formatted text

I'm wondering if it's possible to extract out formatted text from a HTMLDocument using AngleSharp. I'm using the following code to extract the text. The problem I have is that the extracted text runs together, there is no break between each of the elements.

var parser = new HtmlParser();
var document = parser.Parse("<script>var x = 1;</script> <h1>Some example source</h1><p>This is a paragraph element</p>");
var text = document.Body.Text();

This returns the following text

Some example sourceThis is a paragraph element

Ideally I would like it to return Some example source This is a paragraph element where there is some separation between each of the nodes text values.

I know I am late to the party, but better late than never (also I hope someone else benefits from this answer).

The comments on the question are both right. On the one hand we have the W3C specification and the document's source, which tells us that there won't be any space in the (official) serialization, on the other hand we have quite a common case to "integrate" some spaces when applicable (or maybe even newlines, eg, if a <br> element is seen).

That being written the library does not know your specific use case (ie, when you want to insert spaces). However, it can assist you to get more easily to your desired state .

Serialization from the DOM to a string is done via an instance of a class that implements IMarkupFormatter . The ToHtml() method of any DOM node accepts such an object to return a string. Doing a

var myFormatter = new MyMarkupFormatter();
var text = document.Body.ToHtml(myFormatter);

Now the question is reduced to an implementation of MyMarkupFormatter that works for us. This formatter will essentially only yield text nodes, however, with certain tags being treated differently (ie, returning some text such as spaces).

public class MyMarkupFormatter : IMarkupFormatter
{
    String IMarkupFormatter.Comment(IComment comment)
    {
        return String.Empty;
    }

    String IMarkupFormatter.Doctype(IDocumentType doctype)
    {
        return String.Empty;
    }

    String IMarkupFormatter.Processing(IProcessingInstruction processing)
    {
        return String.Empty;
    }

    String IMarkupFormatter.Text(ICharacterData text)
    {
        return text.Data;
    }

    String IMarkupFormatter.OpenTag(IElement element, Boolean selfClosing)
    {
        switch (element.LocalName)
        {
            case "p":
                return "\n\n";
            case "br":
                return "\n";
            case "span":
                return " ";
        }

        return String.Empty;
    }

    String IMarkupFormatter.CloseTag(IElement element, Boolean selfClosing)
    {
        return String.Empty;
    }

    String IMarkupFormatter.Attribute(IAttr attr)
    {
        return String.Empty;
    }
}

If stripping all non-text info is not what you need then AngleSharp also offers the PrettyMarkupFormatter out of the box - maybe this is already quite close to what you wanted (a "prettier" markup formatter).

Hope this helps!

Here's my implementation of IMarkupFormatter. It improves upon Florian's example because it adds line breaks for any block-level element, not just paragraphs. It puts a line break before and after each block-level element, to ensure text from block elements is not put on the same line as text from other nodes. Just like the accepted answer, my implementation uses only one line break for <br> elements. Lastly, it does not add spaces to <span> elements or other inline elements. Instead, it preserves whitespace that was already present in the original HTML string.

using AngleSharp;
using AngleSharp.Dom;

public class TextMarkupFormatter : IMarkupFormatter
{
    public string Text(ICharacterData text)
    {
        return text.Data;
    }

    public string LiteralText(ICharacterData text)
    {
        return "";
    }

    public string Comment(IComment comment)
    {
        return "";
    }

    public string Processing(IProcessingInstruction processing)
    {
        return "";
    }

    public string Doctype(IDocumentType doctype)
    {
        return "";
    }

    public string OpenTag(IElement element, bool selfClosing)
    {
        if (IsBlockLevelElement(element))
            return "\n";

        return "";
    }

    public string CloseTag(IElement element, bool selfClosing)
    {
        if (IsBlockLevelElement(element) || element.TagName == "BR")
            return "\n";

        return "";
    }

    private bool IsBlockLevelElement(IElement element)
    {
        switch (element.TagName)
        {
            case "ADDRESS":
            case "ARTICLE":
            case "ASIDE":
            case "BLOCKQUOTE":
            case "DETAILS":
            case "DIALOG":
            case "DD":
            case "DIV":
            case "DL":
            case "FIELDSET":
            case "FIGCAPTION":
            case "FIGURE":
            case "FOOTER":
            case "FORM":
            case "H1":
            case "H2":
            case "H3":
            case "H4":
            case "H5":
            case "H6":
            case "HEADER":
            case "HGROUP":
            case "HR":
            case "LI":
            case "MAIN":
            case "NAV":
            case "OL":
            case "P":
            case "PRE":
            case "SECTION":
            case "TABLE":
            case "UL":
                return true;

            default:
                return false;
        }
    }
}

If you're working with HTML in a string, and not a full HTML document, you can parse and format it like so:

using var writer = new StringWriter();

new HtmlParser().ParseFragment("Hello<div>World</div>", null).ToHtml(writer, new TextMarkupFormatter());

var text = writer.ToString().Trim();

Console.WriteLine(text); // Writes "Hello\nWorld"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM