I am trying to clean HTML text and to extract plain text from it using Jsoup . The HTML might contain non-english character.
For example the HTML text is:
String html = "<p>Á <a href='http://example.com/'><b>example</b></a> link.</p>";
Now if I use Jsoup#parse(String html)
:
String text = Jsoup.parse(html).text();
It is printing:
Á example link.
And if I clean the text using Jsoup#clean(String bodyHtml, Whitelist whitelist)
:
String text = Jsoup.clean(html, Whitelist.none());
It is printing:
Á example link.
My question is, how can I get the text
Á example link.
using Whitelist
and clean()
method? I want to use Whitelist
since I might be needed to use Whitelist#addTags(String... tags)
.
Any information will be very helpful to me.
Thanks.
Not possible in current version (1.6.1), jsoup print Á
as Á
because the entity escaping feature, there is no "don't escape" mode now (check Entities.EscapeMode
).
You can 1. unescape these HTML entities, 2. extend jsoup's source code by adding a new escape mode with an empty map.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.