简体   繁体   中英

Jsoup clean method leaves   elements

I was trying using this code to clean my text entirely from HTML elements:

Jsoup.clean(preparedText, Whitelist.none())

Unfortunately it didn't remove the   elements. I thought that it will replace it with a whitespace, the same way as it replace the · with a middle dot ("·").

Should I use another method in order to achieve this functionality?

From the Jsoup docs :

Whitelists define what HTML (elements and attributes) to allow through the cleaner. Everything else is removed.

So the whitelist are concerned only with tags and attributes.   is neither a tag nor an attribute. It is simply the html encoding for a special character. If you want to translate from the encoding to normal text you may use for example the excellent apache commons lang library or use the Jsoup unescapeEntities method :

System.out.println(Parser.unescapeEntities(doc.toString(), false));

Addendum:

The translation from · to "·" already happens when you parse the html. It does not seem to have to do with the clean method.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM