简体   繁体   中英

Parsing html with Jsoup and removing spans with certain style

I'm writing an app for a friend but I ran into a problem, the website has these

<span style="display:none">&amp;0000000000000217000000</span>

And we have no idea even what they are, but I need them removed because my app is outputting their value.

Is there any way I can check to see if this is in the Elements and remove it? I have a for-each loop parsing however I cant figure out how to effectively remove this element.

thanks

If you want to remove those spans completely based on the style attribute, try this code:

String html = "<span style=\"display:none\">&amp;0000000000000217000000</span>";
html += "<span style=\"display:none\">&amp;1111111111111111111111111</span>";
html += "<p>Test paragraph should not be removed</p>";

Document doc = Jsoup.parse(html);

doc.select("span[style*=display:none]").remove();

System.out.println(doc);

Here is the output:

<html>
 <head></head>
 <body>
  <p>Test paragraph should not be removed</p>
 </body>
</html>

Just try this:

//Assuming you have all the data in a Document called doc:
String cleanData = doc.select("query").text();

The .text(); method will clean all html tags and substitute all encoding, with human readable content. Oh yeah, and then there's the method ownText(); that might help as well. I can't say which will best fit your purposes.

You can use JSOUP to access the innerHTML of the elements, remove the escaped characters, and replace the innerHTML:

Elements elements = doc.select('span');
for(Element e : elements) {
    e.html( e.html().replaceAll("&amp;","") );
}

In the above example, get a collection of all of the elements, using the selector for all of the elements that contain the offending character. Afterwards, replace the &amp; with the empty string or whatever character you wish.

Additionally, you should know that &amp; is the escape code for the & character. Without escaping & characters, you may have HTML validation issues. In your case, without additional information, I'm assuming you just really want to eliminate them. If not, this will help get you started. Good luck!

If you need to remove the trailing numbers:

// eliminate ampersand and all trailing numbers
e.html( e.html().replaceAll("&amp;[0-9]*","") );

For more information on regular expressions, see the Javadocs on Regex Pattern .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM