简体   繁体   中英

Bad/Strange encoding file from CP1250 to UTF-8 in Java

I have problem with right encoding file from CP1250 to UTF-8. Almost all characters are converted correctly, but characters "ň" and "Ř" not (has "?" char").

At Netbeans I set UTF-8 encoding for project.

Test string in the file can be "skříň SKŘÍŇ". Output at console: "skĹ™ĂĹ? SKĹ?ÍŇ". Output differs from converting, for example, in PHP. I'm in the end.

My code:

BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream("file-cp1250.txt"), "CP1250"));
while ((line = br.readLine()) != null) {
  line = new String(line.getBytes("UTF-8"), "CP1250");
  System.out.println(line);
}

Thanks for advices

The following would be principally correct:

BufferedReader br = new BufferedReader(
    new InputStreamReader(new FileInputStream("file-cp1250.txt"), "CP1250"));
while ((line = br.readLine()) != null) {
    System.out.println(line);
}

That is the binary data of the InputStream is specified as being Windows/Code Page 1250, and is read with decoding. Java String always hold Unicode (so it can combine all scripts).

However System.out is in general the platform dependent console, and that might just not be Cp1250, but something else. The Unicode might be converted to Cp1252, Microsofts Latin-1. And then one is thinking of having some bug. Where System.out simply cannot be used.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM