简体   繁体   中英

Formatting issue with HTML while using JSoup for Java

I'm trying to scrape "text" off of a website with JSoup. I can get the text cleanly (with no formatting at all, just the text), or with all the formatting still attached (ie < br > along with < p > and < /p >).

However, I can't seem to get the formatted version to include < br/ > to any extent, and that's the only thing that was specifically requested to go along with the text.

For example, I can get this:

<p><br>Worldwide database</p>

and this:

Worldwide database

but I can't get this, which is my desired result:

Worldwide database<br/>

I don't see any < br / >'s while looking at the HTML code via the FireBug plugin on Firefox so I'm wondering if that might be the issue? Or maybe there's an issue with the method's I'm using in my code to pull the text?

Anyways, here's my code:

Elements descriptionHTML = doc.select("div[jsname]"); // <-- Get access to the text w/ JSoup
String descText = descriptionHTML.text(); // <-- Get the code w/o any formating at all

// This prints out the desired text with the <p><br> and </p>, but no <br/>
for (Element link : descriptionHTML) 
{
   String jsname = link.attr("jsname");
   if( jsname.equals("C4s9Ed")){                    
        System.out.println(link);
        break;
   }                                        
}

I'd really apprecaite any help with this issue.

Thanks, Jack

HTML does not define a closing tag for <br> elements. XHTML however requires that the tag is marked as empty: <br /> . JSoup parses both, but will print out only normal HTML ( <br> ).

If you use the XML parser in Jsoup, the <br> tags are not closed and so Jsoup tries to guess where to place matching closing tags </br> which are neither HTML nor XHTML compliant.

If you want to keep the line break info and strip out all other tags, I think you need to program that part outside of Jsoup. You could for example replace all <br> and <br /> strings with a uniqe other string, say "_brSplitPos_" , then parse the document with JSoup, print out the text only and replace the "_brSplitPos_" against <br /> :

String html = "<div>This<br>is<br />a<br>test</div>";
html = html.replaceAll("<br(?:\\s+/)?>", "_brSplitPos_");
Document docH = Jsoup.parse(html);
String onlyText = docH.text();
onlyText = onlyText.replace("_brSplitPos_", "<br />");
System.out.println(onlyText);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM