简体   繁体   中英

Jsoup.parse().text() add not desired whitespace

I'm trying to clean a String removing all html tags from it, so this is my code:

System.out.println("Result:" + Jsoup.parse("Dani<div></div>el").text());    

the result is

Result:Dani el

instead should be Result:Daniel

Following Jsoup code I see that the "problem" is in org.jsoup.nodes.Element in this method:

public String text() {
    final StringBuilder accum = new StringBuilder();
    new NodeTraversor(new NodeVisitor() {
        public void head(Node node, int depth) {
            if (node instanceof TextNode) {
                TextNode textNode = (TextNode) node;
                appendNormalisedText(accum, textNode);
            } else if (node instanceof Element) {
                Element element = (Element) node;
                if (accum.length() > 0 &&
                    (element.isBlock() || element.tag.getName().equals("br")) &&
                    !TextNode.lastCharIsWhitespace(accum))
                    accum.append(" ");
            }
        }

        public void tail(Node node, int depth) {
        }
    }).traverse(this);
    return accum.toString().trim();
}

when at some apoint there is accum.append(" "); . Is clear that in some circustances is convenient that a block html tag add a space in the corresponding text version; but in some cases this is not true. In my case infact the result is wrong.

I think would be good that text() method have a boolean parameter preserveWhiteSpaces that enable or disable the execution of the line accum.append(" "); . I hope some developer of Jsoup can consider this request: I seen that also others people has this problem with whitespaces.

If someone has some good idea to solve the problem without change the Jsoup sources is welcomed.

I'm trying to clean a String removing all html tags from it,

You want to use the clean() method.

SAMPLE CODE

System.out.println("Result:" + Jsoup.clean("Dani<div></div>el", Whitelist.none()));

OUTPUT

Result:Daniel

In my case however the result is wrong. Do you have some suggestion to solve this problem?

You can instanciate a NodeTraversor with a custom NodeVisitor .

Just to give you an idea:

private static String toText(Element element) {
    final StringBuilder accum = new StringBuilder();
    new NodeTraversor(new NodeVisitor() {
        public void head(Node node, int depth) {
            if (node instanceof TextNode) {
                TextNode textNode = (TextNode) node;
                accum.append(textNode.getWholeText());
            } else if (node instanceof Element) {
                // Do nothing ...
            }
        }

        public void tail(Node node, int depth) {
        }
    }).traverse(element);

    return accum.toString().trim();
}

SAMPLE CODE

public static void main(String[] args) {
    System.out.println("Result:" + toText(Jsoup.parse("Dani<div></div>el")));
}

OUTPUT

Result:Daniel

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM