I am writing a program that extracts some certain information from local HTML files. That information is then shown on a Java JFrame and is exported to an excel file. (I am using JSoup 1.9.2 library for the HTML parsing purposes)
I am running into an issue where whenever I extract anything from an HTML file, JSoup is not taking HTML tags like break tags, line tags etc. into account and so, all the information is being extracted like a big chunk of data without any proper newlines or formatting.
To show you an example, if this is the data that I want to read :
Title
Line 1
Line 2
Unordered List
- element 1
- element 2
The data is coming back as :
Title Line 1 Line 2 Unordered List element 1 element 2 (ie all the HTML tags are ignored)
This is the piece of code that I am using for reading in :
private String getTitle(Document doc) { // doc is the local HTML file
Elements title = doc.select(".title");
for (Element id : title) {
return id.text();
}
return "No Title Available ";
}
Can anyone suggest me a way that can be used to preserve the meaning behind the HTML tags by using which I can both display the data on the JFrame and export it to excel with a more readable format?
Thanks.
Just to give everyone an update, I was able to find a solution (more like a workaround) to the formatting issue. What i am doing now is extracting the complete HTML using id.html()
which I am storing in a String object. Then, i am using the String function replaceAll()
with a regular expression to get rid of all the HTML tags without pushing everything into a single line. The replaceAll()
function looks something like replaceAll("\\\\<[^>]*>","")
. My whole processHTML() function looks something like :
private String processHTML(String initial) { //initial is the String with all the HTML tags
String modified = initial;
modified = modified.replaceAll("\\<[^>]*>",""); //regular expression used
modified = modified.trim(); //To get rid of any unwanted space before and after the needed data
//All the replaceAll() functions below are to get rid of any HTML entities that might be left in the data extarcted from the HTML
modified = modified.replaceAll(" ", " ");
modified = modified.replaceAll("<", "<");
modified = modified.replaceAll(">", ">");
modified = modified.replaceAll("&", "&");
modified = modified.replaceAll(""", "\"");
modified = modified.replaceAll("'", "\'");
modified = modified.replaceAll("¢", "¢");
modified = modified.replaceAll("©", "©");
modified = modified.replaceAll("®", "®");
return modified;
}
Thanks you all again for helping me with this
Cheers.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.