简体   繁体   中英

Parsing HTML into formatted plaintext using jsoup

I was working on a maven project that allows me to parse a html data from a website. I was able to parse it using this code below:

public void parseData(){
        String url = "http://stackoverflow.com/help/on-topic";
        try {
            Document doc = Jsoup.connect(url).get();
            Element essay = doc.select("div.col-section").first();
            String essayText = essay.text();
            jTextAreaAdem.setText(essayText);


        } catch (IOException ex) {
            Logger.getLogger(formAdem.class.getName()).log(Level.SEVERE, null, ex);
        }
    }

So far I have no problems. I can parse the html data. I was using select method from jsoup and retrieving data using "div.col-section" which means I'm looking for div element with the class is col-section. I wanted to print the data in a textarea. The result that I have is a huge one paragraph even though the real data on the website is more than one paragraphs. So how to parse the data just like the one on the website?

The reason that it is not formatted is that the formatting is in the HTML -- with <p> and <ol> tags etc. Calling .text() on a block element loses that formatting.

Jsoup has an example HTML to Plain Text convertor which you can adapt to your needs -- by providing the div element as the focus.

Alternatively, you could just select "div.col-section > *" , and iterate through each Element, and print out that text with a newline.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM