How to change ' ' to ' ' in HTML using JSoup

Question

I am using JSoup to parse an HTML file and removing elements that aren't valid in XML because I need to apply XSLT to the file. The issue I am running into is the "nbsp;" that exist in my document. I need to change them to unicode '#160;' so that I can run the XSLT on the file.

So I want:

<p> &nbsp; </p> 
<p> &nbsp; </p> 
<p> &nbsp; </p> 
<p> &nbsp; </p>

To Be:

<p> &#160; </p> 
<p> &#160; </p> 
<p> &#160; </p> 
<p> &#160; </p>

I tried using a text replace but it didn't work:

Elements els = doc.body().getAllElements();
for (Element e : els) {
    List<TextNode> tnList = e.textNodes();
    for (TextNode tn : tnList){
        String orig = tn.text();
        tn.text(orig.replaceAll("&nbsp;","&#160;")); 
    }
}

Code that Performs the parsing:

File f = new File ("C:/Users/jrothst/Desktop/Test File.htm");

Document doc = Jsoup.parse(f, "UTF-8");
doc.outputSettings().syntax( Document.OutputSettings.Syntax.xml );  
System.out.println("Starting parse..");
performConversion(doc);

String html = doc.toString();
System.out.println(html);
FileUtils.writeStringToFile(f, doc.outerHtml(), "UTF-8");

How can I make those changes happen using the JSoup libraries?

Answer 1

The following worked for me. You don't need to do any manual search and replace:

File f = new File ("C:/Users/seanbright/Desktop/Test File.htm");

Document doc = Jsoup.parse(f, "UTF-8");
doc.outputSettings()
    .syntax(Document.OutputSettings.Syntax.xml)
    .escapeMode(Entities.EscapeMode.xhtml);

System.out.println(doc.toString());

Input:

<html><head></head><body>&nbsp;</body></html>

Output:

<html><head></head><body>&#xa0;</body></html>

(   is the same thing as   only in hexadecimal instead of decimal)

How to change ' ' to ' ' in HTML using JSoup

Question

1 answers

solution1
3 ACCPTED 2016-07-26 18:06:00

How to change '&nbsp;' to '&#160;' in HTML using JSoup

Question

1 answers

solution1 3 ACCPTED 2016-07-26 18:06:00

How to change ' ' to ' ' in HTML using JSoup

solution1
3 ACCPTED 2016-07-26 18:06:00