简体   繁体   中英

How to preserve the meaning of tags like <br>, <ul> , <li> , <p> etc when reading them in Java using JSOUP library?

I am writing a program that extracts some certain information from local HTML files. That information is then shown on a Java JFrame and is exported to an excel file. (I am using JSoup 1.9.2 library for the HTML parsing purposes)

I am running into an issue where whenever I extract anything from an HTML file, JSoup is not taking HTML tags like break tags, line tags etc. into account and so, all the information is being extracted like a big chunk of data without any proper newlines or formatting.

To show you an example, if this is the data that I want to read :

Title

Line 1

Line 2

    Unordered List
  • element 1
  • element 2

The data is coming back as :

Title Line 1 Line 2 Unordered List element 1 element 2 (ie all the HTML tags are ignored)

This is the piece of code that I am using for reading in :

private String getTitle(Document doc) { // doc is the local HTML file
    Elements title = doc.select(".title");
    for (Element id : title) {
     return id.text();
    }
    return "No Title Available ";
}

Can anyone suggest me a way that can be used to preserve the meaning behind the HTML tags by using which I can both display the data on the JFrame and export it to excel with a more readable format?

Thanks.

Just to give everyone an update, I was able to find a solution (more like a workaround) to the formatting issue. What i am doing now is extracting the complete HTML using id.html() which I am storing in a String object. Then, i am using the String function replaceAll() with a regular expression to get rid of all the HTML tags without pushing everything into a single line. The replaceAll() function looks something like replaceAll("\\\\<[^>]*>","") . My whole processHTML() function looks something like :

private String processHTML(String initial) { //initial is the String with all the HTML tags
        String modified = initial;
        modified = modified.replaceAll("\\<[^>]*>",""); //regular expression used
        modified = modified.trim(); //To get rid of any unwanted space before and after the needed data
        //All the replaceAll() functions below are to get rid of any HTML entities that might be left in the data extarcted from the HTML
        modified = modified.replaceAll("&nbsp;", " ");
        modified = modified.replaceAll("&lt;", "<");
        modified = modified.replaceAll("&gt;", ">");
        modified = modified.replaceAll("&amp;", "&");
        modified = modified.replaceAll("&quot;", "\"");
        modified = modified.replaceAll("&apos;", "\'");
        modified = modified.replaceAll("&cent;", "¢");
        modified = modified.replaceAll("&copy;", "©");
        modified = modified.replaceAll("&reg;", "®");
        return modified;
    }

Thanks you all again for helping me with this

Cheers.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM