[英]Understanding String encoding/decoding Java
I have a program I run with mvn exec:java (my main file is encoded in utf-8 and the default charset of my system is windows-1252) 我有一个使用mvn exec:java运行的程序(我的主文件使用utf-8编码,并且我系统的默认字符集是Windows-1252)
System.out.println(Charset.defaultCharset()); //print windows-1252
String s = "éàè";
System.out.println(new String(s.getBytes(Charset.forName("UTF-8")))); //OK Print éàè
System.out.println(new String(s.getBytes(Charset.forName("windows-1252")))); //Not OK Print ▒▒▒
I don't understand why the first print works, according to the documentation getBytes encodes the String into a sequence of bytes using the given charset and the String constructor constructs a new String by decoding the specified array of bytes using the platform's default charset 我不明白为什么第一次打印会起作用,根据文档getBytes 使用给定的字符集将String 编码为字节序列,而String构造函数通过使用平台的默认字符集解码指定的字节数组来构造新的String
So the first print encodes in UTF-8 and then decode with the platform's default charset which is windows-1252, how could this workd ? 因此,第一个图片使用UTF-8编码,然后使用平台的默认字符集Windows-1252进行解码,这如何工作? It cannot decode the encoded utf-8 byte array using the platform charset windows-1252. 它无法使用平台字符集Windows-1252解码编码的utf-8字节数组。
The second print is wrong, I don't understand why. 第二个打印错误,我不明白为什么。 As my file is encoded in utf-8 and the platform charset is windows-1252, my intention is to encode the String with windows-1252 charset so I call s.getBytes(Charset.forName("windows-1252")) and then create a String with the previous result but it doesn't work 由于我的文件是用utf-8编码的,平台字符集是Windows-1252,因此我的意图是使用Windows-1252字符集对字符串进行编码,因此我先调用s.getBytes(Charset.forName(“ windows-1252”)),然后用之前的结果创建一个字符串,但是不起作用
The String
value éàè
is encoded in UTF-8 as byte octets 0xC3 0xA9 0xC3 0xA0 0xC3 0xA8
. String
值éàè
用UTF-8编码为字节八位字节0xC3 0xA9 0xC3 0xA0 0xC3 0xA8
。 Those same byte octets interpreted as Windows-1252 are the String value éÃ<nbsp>è
(where <nbsp>
is a non-breaking space character, Unicode codepoint U+00A0
). 解释为Windows-1252的相同字节八位位组是字符串值éÃ<nbsp>è
(其中<nbsp>
是不间断的空格字符,Unicode代码点U+00A0
)。
In the first example, you are converting a String
to the above UTF-8 bytes, and then you are converting the bytes back to a String
using Windows-1252 instead of UTF-8. 在第一个示例中,您将一个String
转换为上述UTF-8字节,然后使用Windows-1252而不是UTF-8将这些字节转换回String
。 So you should be getting a new String
value of éÃ<nbsp>è
, not éàè
. 因此,您应该获得一个新的String
值éÃ<nbsp>è
,而不是éàè
。 You are then writing that String
to the console, so it gets encoded using Windows-1252 back to byte octets 0xC3 0xA9 0xC3 0xA0 0xC3 0xA8
, which should be displayed as éÃ<nbsp>è
(or something similar to it) if the console is displaying the bytes as-is. 然后,您将该String
写入控制台,以便使用Windows-1252将其编码回字节八位字节0xC3 0xA9 0xC3 0xA0 0xC3 0xA8
,如果出现,则应显示为éÃ<nbsp>è
(或类似的名称)控制台按原样显示字节。 On the other hand, if the console is configured for UTF-8 instead, those bytes would display as éàè
when interpreted as UTF-8. 另一方面,如果将控制台配置为使用UTF-8,则这些字节在解释为UTF-8时将显示为éàè
。
In the second example, since you are using Windows-1252 for both encoding and decoding, and the particular characters in question are supported by Windows-1252, you should end up with the original String
value éàè
before writing it to the console. 在第二个示例中,由于您同时使用Windows-1252进行编码和解码,并且Windows-1252支持特定的字符, éàè
在将其写入控制台之前,应以原始String
值éàè
。 If that String
gets encoded to bytes using Windows-1252, and the console is configured for UTF-8, it would make sense why you don't see éàè
displayed. 如果使用Windows-1252将String
编码为字节,并且将控制台配置为UTF-8,那么为什么看不到éàè
还是很éàè
。 The String
value éàè
is encoded in Windows-1252 as byte octets 0xE9 0xE0 0xE8
, which is not a valid UTF-8 byte octet sequence. String
值éàè
在Windows-1252中被编码为字节八位字节0xE9 0xE0 0xE8
,这不是有效的UTF-8字节八位字节序列。
In short, the behavior you are seeing would happen when your console is configured to interpret outgoing bytes as UTF-8, but you are not giving it proper UTF-8 encoded bytes as output. 简而言之,当您将控制台配置为将传出字节解释为UTF-8时,会出现这种现象,但是您没有为其提供适当的UTF-8编码字节作为输出。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.