简体   繁体   中英

Jsoup Whitelist: Parsing non-english character

I am trying to clean HTML text and to extract plain text from it using Jsoup . The HTML might contain non-english character.

For example the HTML text is:

String html = "<p>Á <a href='http://example.com/'><b>example</b></a> link.</p>";

Now if I use Jsoup#parse(String html) :

String text = Jsoup.parse(html).text();

It is printing:

Á example link.

And if I clean the text using Jsoup#clean(String bodyHtml, Whitelist whitelist) :

String text = Jsoup.clean(html, Whitelist.none());

It is printing:

&Aacute; example link.

My question is, how can I get the text

Á example link.

using Whitelist and clean() method? I want to use Whitelist since I might be needed to use Whitelist#addTags(String... tags) .

Any information will be very helpful to me.

Thanks.

Not possible in current version (1.6.1), jsoup print Á as &Aacute; because the entity escaping feature, there is no "don't escape" mode now (check Entities.EscapeMode ).

You can 1. unescape these HTML entities, 2. extend jsoup's source code by adding a new escape mode with an empty map.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM