简体   繁体   English

Jsoup.parse()。text()添加不需要的空格

[英]Jsoup.parse().text() add not desired whitespace

I'm trying to clean a String removing all html tags from it, so this is my code: 我试图清理一个字符串,从其中删除所有html标记,所以这是我的代码:

System.out.println("Result:" + Jsoup.parse("Dani<div></div>el").text());    

the result is 结果是

Result:Dani el

instead should be Result:Daniel 相反应该是Result:Daniel

Following Jsoup code I see that the "problem" is in org.jsoup.nodes.Element in this method: 按照Jsoup代码,我发现此方法中的“问题”在org.jsoup.nodes.Element中:

public String text() {
    final StringBuilder accum = new StringBuilder();
    new NodeTraversor(new NodeVisitor() {
        public void head(Node node, int depth) {
            if (node instanceof TextNode) {
                TextNode textNode = (TextNode) node;
                appendNormalisedText(accum, textNode);
            } else if (node instanceof Element) {
                Element element = (Element) node;
                if (accum.length() > 0 &&
                    (element.isBlock() || element.tag.getName().equals("br")) &&
                    !TextNode.lastCharIsWhitespace(accum))
                    accum.append(" ");
            }
        }

        public void tail(Node node, int depth) {
        }
    }).traverse(this);
    return accum.toString().trim();
}

when at some apoint there is accum.append(" "); 在某个时候有accum.append(" "); . Is clear that in some circustances is convenient that a block html tag add a space in the corresponding text version; 很明显,在某些情况下,一个html块标记在相应的文本版本中添加一个空格很方便; but in some cases this is not true. 但在某些情况下,这是不正确的。 In my case infact the result is wrong. 就我而言,结果是错误的。

I think would be good that text() method have a boolean parameter preserveWhiteSpaces that enable or disable the execution of the line accum.append(" "); 我觉得这是很好的文本()方法有一个boolean参数preserveWhiteSpaces是启用或禁用该行的执行accum.append(" "); . I hope some developer of Jsoup can consider this request: I seen that also others people has this problem with whitespaces. 我希望Jsoup的某些开发人员可以考虑此请求:我看到其他人也对空白存在此问题。

If someone has some good idea to solve the problem without change the Jsoup sources is welcomed. 如果有人有一个好主意来解决问题而不进行更改,那么欢迎使用Jsoup来源。

I'm trying to clean a String removing all html tags from it, 我正在尝试清理一个字符串,从中删除所有html标签,

You want to use the clean() method. 您要使用clean()方法。

SAMPLE CODE 样本代码

System.out.println("Result:" + Jsoup.clean("Dani<div></div>el", Whitelist.none()));

OUTPUT 输出值

Result:Daniel

In my case however the result is wrong. 在我看来,结果是错误的。 Do you have some suggestion to solve this problem? 您是否有解决此问题的建议?

You can instanciate a NodeTraversor with a custom NodeVisitor . 您可以使用自定义NodeVisitor实例化NodeTraversor

Just to give you an idea: 只是给你一个想法:

private static String toText(Element element) {
    final StringBuilder accum = new StringBuilder();
    new NodeTraversor(new NodeVisitor() {
        public void head(Node node, int depth) {
            if (node instanceof TextNode) {
                TextNode textNode = (TextNode) node;
                accum.append(textNode.getWholeText());
            } else if (node instanceof Element) {
                // Do nothing ...
            }
        }

        public void tail(Node node, int depth) {
        }
    }).traverse(element);

    return accum.toString().trim();
}

SAMPLE CODE 样本代码

public static void main(String[] args) {
    System.out.println("Result:" + toText(Jsoup.parse("Dani<div></div>el")));
}

OUTPUT 输出值

Result:Daniel

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM