Jsoup.parse（）。text（）添加不需要的空格

Question

I'm trying to clean a String removing all html tags from it, so this is my code: 我试图清理一个字符串，从其中删除所有html标记，所以这是我的代码：

System.out.println("Result:" + Jsoup.parse("Dani<div></div>el").text());

the result is 结果是

Result:Dani el

instead should be Result:Daniel 相反应该是Result:Daniel

Following Jsoup code I see that the "problem" is in org.jsoup.nodes.Element in this method: 按照Jsoup代码，我发现此方法中的“问题”在org.jsoup.nodes.Element中：

public String text() {
    final StringBuilder accum = new StringBuilder();
    new NodeTraversor(new NodeVisitor() {
        public void head(Node node, int depth) {
            if (node instanceof TextNode) {
                TextNode textNode = (TextNode) node;
                appendNormalisedText(accum, textNode);
            } else if (node instanceof Element) {
                Element element = (Element) node;
                if (accum.length() > 0 &&
                    (element.isBlock() || element.tag.getName().equals("br")) &&
                    !TextNode.lastCharIsWhitespace(accum))
                    accum.append(" ");
            }
        }

        public void tail(Node node, int depth) {
        }
    }).traverse(this);
    return accum.toString().trim();
}

when at some apoint there is accum.append(" "); 在某个时候有accum.append(" "); . 。 Is clear that in some circustances is convenient that a block html tag add a space in the corresponding text version; 很明显，在某些情况下，一个html块标记在相应的文本版本中添加一个空格很方便； but in some cases this is not true. 但在某些情况下，这是不正确的。 In my case infact the result is wrong. 就我而言，结果是错误的。

I think would be good that text() method have a boolean parameter preserveWhiteSpaces that enable or disable the execution of the line accum.append(" "); 我觉得这是很好的文本（）方法有一个boolean参数preserveWhiteSpaces是启用或禁用该行的执行accum.append(" "); . 。 I hope some developer of Jsoup can consider this request: I seen that also others people has this problem with whitespaces. 我希望Jsoup的某些开发人员可以考虑此请求：我看到其他人也对空白存在此问题。

If someone has some good idea to solve the problem without change the Jsoup sources is welcomed. 如果有人有一个好主意来解决问题而不进行更改，那么欢迎使用Jsoup来源。

Answer 1

I'm trying to clean a String removing all html tags from it, 我正在尝试清理一个字符串，从中删除所有html标签，

You want to use the clean() method. 您要使用clean()方法。

SAMPLE CODE 样本代码

System.out.println("Result:" + Jsoup.clean("Dani<div></div>el", Whitelist.none()));

OUTPUT 输出值

Result:Daniel

In my case however the result is wrong. 在我看来，结果是错误的。 Do you have some suggestion to solve this problem? 您是否有解决此问题的建议？

You can instanciate a NodeTraversor with a custom NodeVisitor . 您可以使用自定义NodeVisitor实例化NodeTraversor 。

Just to give you an idea: 只是给你一个想法：

private static String toText(Element element) {
    final StringBuilder accum = new StringBuilder();
    new NodeTraversor(new NodeVisitor() {
        public void head(Node node, int depth) {
            if (node instanceof TextNode) {
                TextNode textNode = (TextNode) node;
                accum.append(textNode.getWholeText());
            } else if (node instanceof Element) {
                // Do nothing ...
            }
        }

        public void tail(Node node, int depth) {
        }
    }).traverse(element);

    return accum.toString().trim();
}

SAMPLE CODE 样本代码

public static void main(String[] args) {
    System.out.println("Result:" + toText(Jsoup.parse("Dani<div></div>el")));
}

OUTPUT 输出值

Result:Daniel

Jsoup.parse（）。text（）添加不需要的空格

问题描述

1 个解决方案

解决方案1
1 已采纳 2016-03-31 21:34:55

SAMPLE CODE 样本代码

OUTPUT 输出值

SAMPLE CODE 样本代码

OUTPUT 输出值

Jsoup.parse（）。text（）添加不需要的空格

问题描述

1 个解决方案

解决方案1 1 已采纳 2016-03-31 21:34:55

SAMPLE CODE 样本代码

OUTPUT 输出值

SAMPLE CODE 样本代码

OUTPUT 输出值

解决方案1
1 已采纳 2016-03-31 21:34:55