简体   繁体   English

Java中从CP1250到UTF-8的错误/奇怪编码文件

[英]Bad/Strange encoding file from CP1250 to UTF-8 in Java

I have problem with right encoding file from CP1250 to UTF-8. 我有正确的编码文件从CP1250到UTF-8的问题。 Almost all characters are converted correctly, but characters "ň" and "Ř" not (has "?" char"). 几乎所有字符都正确转换,但是字符“ň”和“Ř”却没有正确转换(具有“?”字符”)。

At Netbeans I set UTF-8 encoding for project. 在Netbeans,我为项目设置了UTF-8编码。

Test string in the file can be "skříň SKŘÍŇ". 文件中的测试字符串可以是“skříňSKŘÍŇ”。 Output at console: "skĹ™ĂĹ? SKĹ?ÍŇ". 在控制台上输出:“skĹ™ĂĹ?SKĹ?ÍŇ”。 Output differs from converting, for example, in PHP. 输出与转换(例如,在PHP中)不同。 I'm in the end. 我最后。

My code: 我的代码:

BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream("file-cp1250.txt"), "CP1250"));
while ((line = br.readLine()) != null) {
  line = new String(line.getBytes("UTF-8"), "CP1250");
  System.out.println(line);
}

Thanks for advices 感谢您的建议

The following would be principally correct: 以下原则上是正确的:

BufferedReader br = new BufferedReader(
    new InputStreamReader(new FileInputStream("file-cp1250.txt"), "CP1250"));
while ((line = br.readLine()) != null) {
    System.out.println(line);
}

That is the binary data of the InputStream is specified as being Windows/Code Page 1250, and is read with decoding. 也就是说,InputStream的二进制数据被指定为Windows /代码页1250,并通过解码读取。 Java String always hold Unicode (so it can combine all scripts). Java String始终保留Unicode(因此它可以合并所有脚本)。

However System.out is in general the platform dependent console, and that might just not be Cp1250, but something else. 但是, System.out通常是依赖于平台的控制台,可能不是Cp1250,而是其他东西。 The Unicode might be converted to Cp1252, Microsofts Latin-1. Unicode可能会转换为Microsoft的Latin-1 Cp1252。 And then one is thinking of having some bug. 然后,人们想到了一些错误。 Where System.out simply cannot be used. 无法使用System.out的地方。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM