I was trying using this code to clean my text entirely from HTML elements:
Jsoup.clean(preparedText, Whitelist.none())
Unfortunately it didn't remove the
elements. I thought that it will replace it with a whitespace, the same way as it replace the ·
with a middle dot ("·").
Should I use another method in order to achieve this functionality?
From the Jsoup docs :
Whitelists define what HTML (elements and attributes) to allow through the cleaner. Everything else is removed.
So the whitelist are concerned only with tags and attributes.
is neither a tag nor an attribute. It is simply the html encoding for a special character. If you want to translate from the encoding to normal text you may use for example the excellent apache commons lang library or use the Jsoup unescapeEntities method :
System.out.println(Parser.unescapeEntities(doc.toString(), false));
Addendum:
The translation from ·
to "·" already happens when you parse the html. It does not seem to have to do with the clean method.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.