简体   繁体   中英

alternative of JSoup or how to clean whitespaces

Does somebody know an alternative of JSoup ?

Or how to clean sequences like <p>&nbsp;</p> ?

HTML Clean plug-in for jQuery works well for me, but I'm interested in doing the html code cleaning at server side , not in the client side .

Or, what is the replaceAll expression to do??:

String cleanS = dirtyS.replaceAll("<p>&nbsp;</p>", ""); //This doesnt work

I have discovered that the dirty html comes with mixed sequences of blank spaces #160 , and others like #32 .

So, what I need is a expression to remove whatever mixture of them.

混合空间空白

You can change the OutputSettings for this:

Example:

final String html = ...;


OutputSettings settings = new OutputSettings();
settings.escapeMode(Entities.EscapeMode.xhtml);

String cleanHtml = Jsoup.clean(html, "", Whitelist.relaxed(), settings);

This is possible with a Document parsed by Jsoup too:

Document doc = Jsoup.parse(...);
doc.outputSettings().escapeMode(Entities.EscapeMode.xhtml);

// ...

Edit:

Removing tags:

doc.select("p:matchesOwn((?is) )").remove();

Please note: after (?is) there's not a blank, but char #160 (= nbsp). This will remove all p-Tags who's own text is only a &nbsp; . If you want do so with all other tags, you can replace the p: with *: .

If you have the document object, you can loop over the paragrap elements and remove all those that don't have text (or non white space text) in them. before checking if the text is empty, you can replace the occurrences of NBSP; with white space. Assuming your working ith UTF-8 documents the following might work for you:

public static final String NBSP_IN_UTF8 = "\u00a0"; 

Assuming you know how to get the Document object, the loop to clean is simple: select the paragraph elements and remove empty ones:

org.jsoup.nodes.Document doc= ...   //obtain your document object  
for (org.jsoup.nodes.Element element : doc.select("p")) {
    if ( !element.hasText() || element.text().replaceAll(NBSP_IN_UTF8, "").trim().equals("") ) {
       element.remove();
    }
  }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM