Jsoup clean method leaves   elements

Question

I was trying using this code to clean my text entirely from HTML elements:

Jsoup.clean(preparedText, Whitelist.none())

Unfortunately it didn't remove the   elements. I thought that it will replace it with a whitespace, the same way as it replace the · with a middle dot ("·").

Should I use another method in order to achieve this functionality?

Answer 1

From the Jsoup docs :

Whitelists define what HTML (elements and attributes) to allow through the cleaner. Everything else is removed.

So the whitelist are concerned only with tags and attributes.   is neither a tag nor an attribute. It is simply the html encoding for a special character. If you want to translate from the encoding to normal text you may use for example the excellent apache commons lang library or use the Jsoup unescapeEntities method :

System.out.println(Parser.unescapeEntities(doc.toString(), false));

Addendum:

The translation from · to "·" already happens when you parse the html. It does not seem to have to do with the clean method.

Jsoup clean method leaves   elements

Question

1 answers

solution1
2 2016-01-19 10:52:06

Jsoup clean method leaves &nbsp; elements

Question

1 answers

solution1 2 2016-01-19 10:52:06

Jsoup clean method leaves elements

solution1
2 2016-01-19 10:52:06