Extract text from html string using Jsoup with specific encoding

Question

Here is what I have -

String html = "<p><b>Annie's and Lärabar</b></p>"

after run the following -

org.jsoup.nodes.Document doc = Jsoup.parse(html);
Element p= doc.select("p").first();
String s = p.text();
System.out.println(s);

output - "Annie's and L?rabar".

The character "ä" became a question mark.

My JVM environment is "iso-8859-1", it seems to me that Jsoup's default encoding is utf-8. I would like to force Jsoup.parse() to use "iso-8859-1" when parsing the html string.

I read the API and googled examples, but I just can't find any single one example which indicates that Jsoup.parse() can actually take in a specific encoding when parsing a string?

Can anyone help? Thank you in advance!

-Cyn

Answer 1

You can set char set to Document as below

org.jsoup.nodes.Document doc = Jsoup.parse(html);
doc.charset(Charset charset);
Element p= doc.select("p").first();
String s = p.text();

Hope this help. Refer: https://jsoup.org/apidocs/org/jsoup/nodes/Document.html#charset-java.nio.charset.Charset-

Extract text from html string using Jsoup with specific encoding

Question

1 answers

solution1
0 ACCPTED 2017-03-10 06:34:25

Extract text from html string using Jsoup with specific encoding

Question

1 answers

solution1 0 ACCPTED 2017-03-10 06:34:25

solution1
0 ACCPTED 2017-03-10 06:34:25