Jsoup Whitelist: Parsing non-english character

Question

I am trying to clean HTML text and to extract plain text from it using Jsoup . The HTML might contain non-english character.

For example the HTML text is:

String html = "<p>Á <a href='http://example.com/'><b>example</b></a> link.</p>";

Now if I use Jsoup#parse(String html) :

String text = Jsoup.parse(html).text();

It is printing:

Á example link.

And if I clean the text using Jsoup#clean(String bodyHtml, Whitelist whitelist) :

String text = Jsoup.clean(html, Whitelist.none());

It is printing:

&Aacute; example link.

My question is, how can I get the text

Á example link.

using Whitelist and clean() method? I want to use Whitelist since I might be needed to use Whitelist#addTags(String... tags) .

Any information will be very helpful to me.

Thanks.

Answer 1

Not possible in current version (1.6.1), jsoup print Á as Á because the entity escaping feature, there is no "don't escape" mode now (check Entities.EscapeMode ).

You can 1. unescape these HTML entities, 2. extend jsoup's source code by adding a new escape mode with an empty map.

Jsoup Whitelist: Parsing non-english character

Question

1 answers

solution1
1 ACCPTED 2012-03-03 08:40:27

Jsoup Whitelist: Parsing non-english character

Question

1 answers

solution1 1 ACCPTED 2012-03-03 08:40:27

solution1
1 ACCPTED 2012-03-03 08:40:27