简体   繁体   中英

How to keep line breaks when using Jsoup.parse?

This is not a duplicate. The was a similar question , but none of those answers are able to deal with a real html file. One can save any html, even this one and try to run any of the solutions to that answer ... none of them solves the problem completely


The question is

I have a saved .htm file on my desktop. I need to get pure text from it . However I do need to keep the line breaks so that the text is not on just one or couple of lines.

I tried the following and all methods from here

        FileInputStream in = new FileInputStream("C:\\...myfile.htm");
        String htmlText = IOUtils.toString(in);
        for (String line : htmlText.split("\n")) {
            String stripped = Jsoup.parse(line).text();
            System.out.println(stripped);
        }

This does preserve only lines of html file. However, the text is still messed up, because such things as </br> , <p> got removed. How can I parse so that the text preserves all natural line breaks.

This is something I've noticed the difference between jsoup and say Selenium where Selenium keeps the line breaks and jsoup does not when extracting text. With that said, i think the best route is to get the innerHtml on the node you are trying to extract text, then do a replaceAll on the innerHtml to replace </br> and <p> with line breaks.

As a more complete solution, instead of reading the text file line by line, is it possible to traverse the html text more natively? Your best bet would be to traverse the tree using something like a recursive function and when you hit a TextNode, add that text to the stripped variable from your example. Then when you hit a <p> or </br> element, you can add a linefeed as need be.

Something like:

Document doc = Jsoup.parse(htmlText);

Then pass that in a recursive function for each child node:

String getText(Element parentElement) {
     String working = "";
     for (Node child : parentElement.childNodes()) {
          if (child instanceof TextNode) {
              working += child.text();
          }
          if (child instanceof Element) {
              Element childElement = (Element)child;
              // do more of these for p or other tags you want a new line for
              if (childElement.tag().getName().equalsIgnoreCase("br")) {
                   working += "\n";
              }                  
              working += getText(childElement);
          }
     }

     return working;
 }

Then you can just call the function to strip the text.

 strippedText = getText(doc);

Not the simplest solution, but one i can think of that should work if you want to extract all text from an HTML. I haven't run this code, just wrote it now so if i missed something, i apologize. But it should give you the general idea.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM