简体   繁体   中英

Jsoup parse iso-8859-1 file

I've been looking online and trying to understand. I am parsing some html files that are encoded in iso-8859-1. Once parsed I want all the output to be in the standard java encoding (utf-something)

Here is how I do this:

currentDocument = Jsoup.parse(new File("thing.htm", "ISO-8859-1");
Element elt = currentDocument.getElementById("bim");
String title = elt.select("h1,h2,h3,h4,h5,h6").first().text();
System.out.println(title);

The string in the file is:

G18 Legemiddeløkonomi – pasientens venn eller fiende

The output is:

G18?Legemiddel?konomi ? pasientens venn eller fiende

I guess I'm doing something wrong somewhere as I know this is possible with Jsoup I just don't really know what it is. Btw I'm on MacOSX. Can somebody help me?

Thx

Ok so after investigating further and thanks to @Esailija I found that my console wasn't outputing in UTF-8 which was solved by:

PrintStream stdout = new PrintStream(System.out, true, "UTF-8"); 
System.setOut(stdout);

I also used: currentDocument.outputSettings().charset("UTF-8"); but I am not sure this is useful.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM