[英]String byte encoding issue
Given that I have following function鉴于我有以下功能
static void fun(String str) {
System.out.println(String.format("%s | length in String: %d | length in bytes: %d | bytes: %s", str, str.length(), str.getBytes().length, Arrays.toString(str.getBytes())));
}
on invoking fun("ó");
调用
fun("ó");
its output is它的输出是
ó | length in String: 1 | length in bytes: 2 | bytes: [-61, -77]
so it means character ó needs 2 bytes to represent and as per Character class documentation too default is UTF-16 in java, considering that when I do following所以这意味着字符 ó 需要 2 个字节来表示,并且根据 Character 类文档,Java 中的默认值也是 UTF-16,考虑到当我执行以下操作时
System.out.println(new String("ó".getBytes(), StandardCharsets.UTF_16));// output=쎳
System.out.println(new String("ó".getBytes(), StandardCharsets.ISO_8859_1));// output=ó
System.out.println(new String("ó".getBytes(), StandardCharsets.US_ASCII));// output=��
System.out.println(new String("ó".getBytes(), StandardCharsets.UTF_8));// output=ó
System.out.println(new String("ó".getBytes(), StandardCharsets.UTF_16BE));// output=쎳
System.out.println(new String("ó".getBytes(), StandardCharsets.UTF_16LE));// output=돃
Why any of UTF_16, UTF_16BE, UTF_16LE charset not able to decode bytes properly, given that bytes are representing a 16 bit length character?为什么任何 UTF_16、UTF_16BE、UTF_16LE 字符集都无法正确解码字节,因为字节代表 16 位长度的字符? And how UTF-8 is able decode it properly given that UTF-8 consider each character only 8 bit long so it should have printed 2 chars(1 char for each byte) like in ISO_8859_1.
以及 UTF-8 如何正确解码它,因为 UTF-8 认为每个字符只有 8 位长,所以它应该像 ISO_8859_1 一样打印 2 个字符(每个字节 1 个字符)。
getBytes
always returns the bytes encoded in the platform's default charset, which is probably UTF-8 for you. getBytes
始终返回以平台默认字符集编码的字节,这对您来说可能是 UTF-8。
Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.
使用平台的默认字符集将此 String 编码为字节序列,并将结果存储到新的字节数组中。
So you are essentially trying to decode a bunch of UTF-8 bytes with non-UTF-8 charsets.因此,您实际上是在尝试使用非 UTF-8 字符集解码一堆 UTF-8 字节。 No wonder you don't get expected results.
难怪你没有得到预期的结果。
Though kind of pointless, you can get what you want by passing the desired charset to getBytes
, so that the string is encoded correctly.虽然有点无意义,但您可以通过将所需的字符集传递给
getBytes
来获得所需的内容,以便正确编码字符串。
System.out.println(new String("ó".getBytes(StandardCharsets.UTF_16), StandardCharsets.UTF_16));
System.out.println(new String("ó".getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.ISO_8859_1));
System.out.println(new String("ó".getBytes(StandardCharsets.US_ASCII), StandardCharsets.US_ASCII));
System.out.println(new String("ó".getBytes(StandardCharsets.UTF_8), StandardCharsets.UTF_8));
System.out.println(new String("ó".getBytes(StandardCharsets.UTF_16BE), StandardCharsets.UTF_16BE));
System.out.println(new String("ó".getBytes(StandardCharsets.UTF_16LE), StandardCharsets.UTF_16LE));
You also seem to have some misunderstanding about encodings.您似乎也对编码有一些误解。 It's not just about the number of bytes that a character takes.
这不仅仅是一个字符占用的字节数。 The byte-count-per-character for two encodings being the same doesn't mean that they are compatible with each other.
两种编码的每个字符的字节数相同并不意味着它们彼此兼容。 Also, it is not always one byte per character in UTF-8.
此外,在 UTF-8 中每个字符并不总是一个字节。 UTF-8 is a variable-length encoding.
UTF-8是一种可变长度编码。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.