简体   繁体   English

使用jsoup将html转换为纯文本时,如何保留换行符?

[英]How do I preserve line breaks when using jsoup to convert html to plain text?

I have the following code: 我有以下代码:

 public class NewClass {
     public String noTags(String str){
         return Jsoup.parse(str).text();
     }


     public static void main(String args[]) {
         String strings="<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN \">" +
         "<HTML> <HEAD> <TITLE></TITLE> <style>body{ font-size: 12px;font-family: verdana, arial, helvetica, sans-serif;}</style> </HEAD> <BODY><p><b>hello world</b></p><p><br><b>yo</b> <a href=\"http://google.com\">googlez</a></p></BODY> </HTML> ";

         NewClass text = new NewClass();
         System.out.println((text.noTags(strings)));
}

And I have the result: 结果是:

hello world yo googlez

But I want to break the line: 但我想打破界限:

hello world
yo googlez

I have looked at jsoup's TextNode#getWholeText() but I can't figure out how to use it. 我已经看过jsoup的TextNode#getWholeText(),但是我不知道如何使用它。

If there's a <br> in the markup I parse, how can I get a line break in my resulting output? 如果我解析的标记中有一个<br> ,如何在我得到的输出中换行?

The real solution that preserves linebreaks should be like this: 保留换行符的真正解决方案应该是这样的:

public static String br2nl(String html) {
    if(html==null)
        return html;
    Document document = Jsoup.parse(html);
    document.outputSettings(new Document.OutputSettings().prettyPrint(false));//makes html() preserve linebreaks and spacing
    document.select("br").append("\\n");
    document.select("p").prepend("\\n\\n");
    String s = document.html().replaceAll("\\\\n", "\n");
    return Jsoup.clean(s, "", Whitelist.none(), new Document.OutputSettings().prettyPrint(false));
}

It satisfies the following requirements: 满足以下要求:

  1. if the original html contains newline(\\n), it gets preserved 如果原始html包含换行符(\\ n),则保留它
  2. if the original html contains br or p tags, they gets translated to newline(\\n). 如果原始html包含br或p标签,它们将被翻译为换行符(\\ n)。
Jsoup.clean(unsafeString, "", Whitelist.none(), new OutputSettings().prettyPrint(false));

We're using this method here: 我们在这里使用这种方法:

public static String clean(String bodyHtml,
                       String baseUri,
                       Whitelist whitelist,
                       Document.OutputSettings outputSettings)

By passing it Whitelist.none() we make sure that all HTML is removed. 通过将其传递给Whitelist.none()我们确保删除所有HTML。

By passsing new OutputSettings().prettyPrint(false) we make sure that the output is not reformatted and line breaks are preserved. 通过传递new OutputSettings().prettyPrint(false)我们确保输出未重新格式化并保留了换行符。

With

Jsoup.parse("A\nB").text();

you have output 你有输出

"A B" 

and not 并不是

A

B

For this I'm using: 为此,我正在使用:

descrizione = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "br2n")).text();
text = descrizione.replaceAll("br2n", "\n");

Try this by using jsoup: 使用jsoup尝试一下:

public static String cleanPreserveLineBreaks(String bodyHtml) {

    // get pretty printed html with preserved br and p tags
    String prettyPrintedBodyFragment = Jsoup.clean(bodyHtml, "", Whitelist.none().addTags("br", "p"), new OutputSettings().prettyPrint(true));
    // get plain text with preserved line breaks by disabled prettyPrint
    return Jsoup.clean(prettyPrintedBodyFragment, "", Whitelist.none(), new OutputSettings().prettyPrint(false));
}

On Jsoup v1.11.2, we can now use Element.wholeText() . 在Jsoup v1.11.2上,我们现在可以使用Element.wholeText()

Example code: 示例代码:

String cleanString = Jsoup.parse(htmlString).wholeText();

user121196's answer still works. user121196's 答案仍然有效。 But wholeText() preserves the alignment of texts. 但是wholeText()保留文本的对齐方式。

You can traverse a given element 您可以遍历给定的元素

public String convertNodeToText(Element element)
{
    final StringBuilder buffer = new StringBuilder();

    new NodeTraversor(new NodeVisitor() {
        boolean isNewline = true;

        @Override
        public void head(Node node, int depth) {
            if (node instanceof TextNode) {
                TextNode textNode = (TextNode) node;
                String text = textNode.text().replace('\u00A0', ' ').trim();                    
                if(!text.isEmpty())
                {                        
                    buffer.append(text);
                    isNewline = false;
                }
            } else if (node instanceof Element) {
                Element element = (Element) node;
                if (!isNewline)
                {
                    if((element.isBlock() || element.tagName().equals("br")))
                    {
                        buffer.append("\n");
                        isNewline = true;
                    }
                }
            }                
        }

        @Override
        public void tail(Node node, int depth) {                
        }                        
    }).traverse(element);        

    return buffer.toString();               
}

And for your code 而对于您的代码

String result = convertNodeToText(JSoup.parse(html))
text = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "br2n")).text();
text = descrizione.replaceAll("br2n", "\n");

works if the html itself doesn't contain "br2n" 如果html本身不包含“ br2n”,则可以使用

So, 所以,

text = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "<pre>\n</pre>")).text();

works more reliable and easier. 工作更可靠,更轻松。

Try this by using jsoup: 使用jsoup尝试一下:

    doc.outputSettings(new OutputSettings().prettyPrint(false));

    //select all <br> tags and append \n after that
    doc.select("br").after("\\n");

    //select all <p> tags and prepend \n before that
    doc.select("p").before("\\n");

    //get the HTML from the document, and retaining original new lines
    String str = doc.html().replaceAll("\\\\n", "\n");

Use textNodes() to get a list of the text nodes. 使用textNodes()获取文本节点列表。 Then concatenate them with \\n as separator. 然后使用\\n作为分隔符将它们连接起来。 Here's some scala code I use for this, java port should be easy: 这是我使用的一些scala代码,java端口应该很简单:

val rawTxt = doc.body().getElementsByTag("div").first.textNodes()
                    .asScala.mkString("<br />\n")

This is my version of translating html to text (the modified version of user121196 answer, actually). 这是我将html转换为文本的版本(实际上是user121196答案的修改版本)。

This doesn't just preserve line breaks, but also formatting text and removing excessive line breaks, HTML escape symbols, and you will get a much better result from your HTML (in my case I'm receiving it from mail). 这不仅保留了换行符,还格式化了文本并删除了过多的换行符,HTML转义符号,并且HTML会带来更好的结果(就我而言,我是从邮件中收到的)。

It's originally written in Scala, but you can change it to Java easily 它最初是用Scala编写的,但是您可以轻松地将其更改为Java

def html2text( rawHtml : String ) : String = {

    val htmlDoc = Jsoup.parseBodyFragment( rawHtml, "/" )
    htmlDoc.select("br").append("\\nl")
    htmlDoc.select("div").prepend("\\nl").append("\\nl")
    htmlDoc.select("p").prepend("\\nl\\nl").append("\\nl\\nl")

    org.jsoup.parser.Parser.unescapeEntities(
        Jsoup.clean(
          htmlDoc.html(),
          "",
          Whitelist.none(),
          new org.jsoup.nodes.Document.OutputSettings().prettyPrint(true)
        ),false
    ).
    replaceAll("\\\\nl", "\n").
    replaceAll("\r","").
    replaceAll("\n\\s+\n","\n").
    replaceAll("\n\n+","\n\n").     
    trim()      
}

Based on the other answers and the comments on this question it seems that most people coming here are really looking for a general solution that will provide a nicely formatted plain text representation of an HTML document. 根据其他答案和对该问题的评论,似乎大多数来这里的人们都在寻找一种通用的解决方案,该解决方案将提供HTML文档格式良好的纯文本表示形式。 I know I was. 我知道我曾经

Fortunately JSoup already provide a pretty comprehensive example of how to achieve this: HtmlToPlainText.java 幸运的是,JSoup已经提供了一个有关如何实现此目标的非常全面的示例: HtmlToPlainText.java

The example FormattingVisitor can easily be tweaked to your preference and deals with most block elements and line wrapping. 可以很容易地根据您的偏好调整示例FormattingVisitor并处理大多数块元素和换行。

To avoid link rot, here is Jonathan Hedley 's solution in full: 为了避免链接腐烂,下面是Jonathan Hedley的完整解决方案:

package org.jsoup.examples;

import org.jsoup.Jsoup;
import org.jsoup.helper.StringUtil;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.nodes.TextNode;
import org.jsoup.select.Elements;
import org.jsoup.select.NodeTraversor;
import org.jsoup.select.NodeVisitor;

import java.io.IOException;

/**
 * HTML to plain-text. This example program demonstrates the use of jsoup to convert HTML input to lightly-formatted
 * plain-text. That is divergent from the general goal of jsoup's .text() methods, which is to get clean data from a
 * scrape.
 * <p>
 * Note that this is a fairly simplistic formatter -- for real world use you'll want to embrace and extend.
 * </p>
 * <p>
 * To invoke from the command line, assuming you've downloaded the jsoup jar to your current directory:</p>
 * <p><code>java -cp jsoup.jar org.jsoup.examples.HtmlToPlainText url [selector]</code></p>
 * where <i>url</i> is the URL to fetch, and <i>selector</i> is an optional CSS selector.
 * 
 * @author Jonathan Hedley, jonathan@hedley.net
 */
public class HtmlToPlainText {
    private static final String userAgent = "Mozilla/5.0 (jsoup)";
    private static final int timeout = 5 * 1000;

    public static void main(String... args) throws IOException {
        Validate.isTrue(args.length == 1 || args.length == 2, "usage: java -cp jsoup.jar org.jsoup.examples.HtmlToPlainText url [selector]");
        final String url = args[0];
        final String selector = args.length == 2 ? args[1] : null;

        // fetch the specified URL and parse to a HTML DOM
        Document doc = Jsoup.connect(url).userAgent(userAgent).timeout(timeout).get();

        HtmlToPlainText formatter = new HtmlToPlainText();

        if (selector != null) {
            Elements elements = doc.select(selector); // get each element that matches the CSS selector
            for (Element element : elements) {
                String plainText = formatter.getPlainText(element); // format that element to plain text
                System.out.println(plainText);
            }
        } else { // format the whole doc
            String plainText = formatter.getPlainText(doc);
            System.out.println(plainText);
        }
    }

    /**
     * Format an Element to plain-text
     * @param element the root element to format
     * @return formatted text
     */
    public String getPlainText(Element element) {
        FormattingVisitor formatter = new FormattingVisitor();
        NodeTraversor traversor = new NodeTraversor(formatter);
        traversor.traverse(element); // walk the DOM, and call .head() and .tail() for each node

        return formatter.toString();
    }

    // the formatting rules, implemented in a breadth-first DOM traverse
    private class FormattingVisitor implements NodeVisitor {
        private static final int maxWidth = 80;
        private int width = 0;
        private StringBuilder accum = new StringBuilder(); // holds the accumulated text

        // hit when the node is first seen
        public void head(Node node, int depth) {
            String name = node.nodeName();
            if (node instanceof TextNode)
                append(((TextNode) node).text()); // TextNodes carry all user-readable text in the DOM.
            else if (name.equals("li"))
                append("\n * ");
            else if (name.equals("dt"))
                append("  ");
            else if (StringUtil.in(name, "p", "h1", "h2", "h3", "h4", "h5", "tr"))
                append("\n");
        }

        // hit when all of the node's children (if any) have been visited
        public void tail(Node node, int depth) {
            String name = node.nodeName();
            if (StringUtil.in(name, "br", "dd", "dt", "p", "h1", "h2", "h3", "h4", "h5"))
                append("\n");
            else if (name.equals("a"))
                append(String.format(" <%s>", node.absUrl("href")));
        }

        // appends text to the string builder with a simple word wrap method
        private void append(String text) {
            if (text.startsWith("\n"))
                width = 0; // reset counter if starts with a newline. only from formats above, not in natural text
            if (text.equals(" ") &&
                    (accum.length() == 0 || StringUtil.in(accum.substring(accum.length() - 1), " ", "\n")))
                return; // don't accumulate long runs of empty spaces

            if (text.length() + width > maxWidth) { // won't fit, needs to wrap
                String words[] = text.split("\\s+");
                for (int i = 0; i < words.length; i++) {
                    String word = words[i];
                    boolean last = i == words.length - 1;
                    if (!last) // insert a space if not the last word
                        word = word + " ";
                    if (word.length() + width > maxWidth) { // wrap and reset counter
                        accum.append("\n").append(word);
                        width = word.length();
                    } else {
                        accum.append(word);
                        width += word.length();
                    }
                }
            } else { // fits as is, without need to wrap text
                accum.append(text);
                width += text.length();
            }
        }

        @Override
        public String toString() {
            return accum.toString();
        }
    }
}

For more complex HTML none of the above solutions worked quite right; 对于更复杂的HTML,上述解决方案均无法正常工作。 I was able to successfully do the conversion while preserving line breaks with: 我能够成功进行转换,同时保留以下换行符:

Document document = Jsoup.parse(myHtml);
String text = new HtmlToPlainText().getPlainText(document);

(version 1.10.3) (版本1.10.3)

Try this: 尝试这个:

public String noTags(String str){
    Document d = Jsoup.parse(str);
    TextNode tn = new TextNode(d.body().html(), "");
    return tn.getWholeText();
}
/**
 * Recursive method to replace html br with java \n. The recursive method ensures that the linebreaker can never end up pre-existing in the text being replaced.
 * @param html
 * @param linebreakerString
 * @return the html as String with proper java newlines instead of br
 */
public static String replaceBrWithNewLine(String html, String linebreakerString){
    String result = "";
    if(html.contains(linebreakerString)){
        result = replaceBrWithNewLine(html, linebreakerString+"1");
    } else {
        result = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", linebreakerString)).text(); // replace and html line breaks with java linebreak.
        result = result.replaceAll(linebreakerString, "\n");
    }
    return result;
}

Used by calling with the html in question, containing the br, along with whatever string you wish to use as the temporary newline placeholder. 通过与所涉及的html一起调用而使用,该html包含br以及希望用作临时换行符的任何字符串。 For example: 例如:

replaceBrWithNewLine(element.html(), "br2n")

The recursion will ensure that the string you use as newline/linebreaker placeholder will never actually be in the source html, as it will keep adding a "1" untill the linkbreaker placeholder string is not found in the html. 递归将确保您用作换行符/换行符占位符的字符串实际上不会出现在源html中,因为它将一直添加“ 1”,直到在html中找不到linkbreaker占位符字符串为止。 It wont have the formatting issue that the Jsoup.clean methods seem to encounter with special characters. 它不会出现Jsoup.clean方法似乎带有特殊字符的格式问题。

Based on user121196's and Green Beret's answer with the select s and <pre> s, the only solution which works for me is: 根据user121196和Green Beret的select<pre>回答,唯一适用于我的解决方案是:

org.jsoup.nodes.Element elementWithHtml = ....
elementWithHtml.select("br").append("<pre>\n</pre>");
elementWithHtml.select("p").prepend("<pre>\n\n</pre>");
elementWithHtml.text();

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM