Formatting issue with HTML while using JSoup for Java

Question

I'm trying to scrape "text" off of a website with JSoup. I can get the text cleanly (with no formatting at all, just the text), or with all the formatting still attached (ie along with and ).

However, I can't seem to get the formatted version to include to any extent, and that's the only thing that was specifically requested to go along with the text.

For example, I can get this:

<p><br>Worldwide database</p>

and this:

Worldwide database

but I can't get this, which is my desired result:

Worldwide database<br/>

I don't see any 's while looking at the HTML code via the FireBug plugin on Firefox so I'm wondering if that might be the issue? Or maybe there's an issue with the method's I'm using in my code to pull the text?

Anyways, here's my code:

Elements descriptionHTML = doc.select("div[jsname]"); // <-- Get access to the text w/ JSoup
String descText = descriptionHTML.text(); // <-- Get the code w/o any formating at all

// This prints out the desired text with the <p><br> and </p>, but no <br/>
for (Element link : descriptionHTML) 
{
   String jsname = link.attr("jsname");
   if( jsname.equals("C4s9Ed")){                    
        System.out.println(link);
        break;
   }                                        
}

I'd really apprecaite any help with this issue.

Thanks, Jack

Answer 1

HTML does not define a closing tag for   elements. XHTML however requires that the tag is marked as empty:   . JSoup parses both, but will print out only normal HTML (   ).

If you use the XML parser in Jsoup, the   tags are not closed and so Jsoup tries to guess where to place matching closing tags  which are neither HTML nor XHTML compliant.

If you want to keep the line break info and strip out all other tags, I think you need to program that part outside of Jsoup. You could for example replace all   and   strings with a uniqe other string, say "_brSplitPos_" , then parse the document with JSoup, print out the text only and replace the "_brSplitPos_" against   :

String html = "<div>This<br>is<br />a<br>test</div>";
html = html.replaceAll("<br(?:\\s+/)?>", "_brSplitPos_");
Document docH = Jsoup.parse(html);
String onlyText = docH.text();
onlyText = onlyText.replace("_brSplitPos_", "<br />");
System.out.println(onlyText);

Formatting issue with HTML while using JSoup for Java

Question

1 answers

solution1
1 ACCPTED 2015-12-05 11:13:59

Formatting issue with HTML while using JSoup for Java

Question

1 answers

solution1 1 ACCPTED 2015-12-05 11:13:59

solution1
1 ACCPTED 2015-12-05 11:13:59