I am using JSoup to parse an HTML file and removing elements that aren't valid in XML because I need to apply XSLT to the file. The issue I am running into is the "nbsp;" that exist in my document. I need to change them to unicode '#160;' so that I can run the XSLT on the file.
So I want:
<p> </p>
<p> </p>
<p> </p>
<p> </p>
To Be:
<p>   </p>
<p>   </p>
<p>   </p>
<p>   </p>
I tried using a text replace but it didn't work:
Elements els = doc.body().getAllElements();
for (Element e : els) {
List<TextNode> tnList = e.textNodes();
for (TextNode tn : tnList){
String orig = tn.text();
tn.text(orig.replaceAll(" "," "));
}
}
Code that Performs the parsing:
File f = new File ("C:/Users/jrothst/Desktop/Test File.htm");
Document doc = Jsoup.parse(f, "UTF-8");
doc.outputSettings().syntax( Document.OutputSettings.Syntax.xml );
System.out.println("Starting parse..");
performConversion(doc);
String html = doc.toString();
System.out.println(html);
FileUtils.writeStringToFile(f, doc.outerHtml(), "UTF-8");
How can I make those changes happen using the JSoup libraries?
The following worked for me. You don't need to do any manual search and replace:
File f = new File ("C:/Users/seanbright/Desktop/Test File.htm");
Document doc = Jsoup.parse(f, "UTF-8");
doc.outputSettings()
.syntax(Document.OutputSettings.Syntax.xml)
.escapeMode(Entities.EscapeMode.xhtml);
System.out.println(doc.toString());
Input:
<html><head></head><body> </body></html>
Output:
<html><head></head><body> </body></html>
(  
is the same thing as  
only in hexadecimal instead of decimal)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.