简体   繁体   English

字符串字节编码问题

[英]String byte encoding issue

Given that I have following function鉴于我有以下功能

static void fun(String str) {
        System.out.println(String.format("%s | length in String: %d | length in bytes: %d | bytes: %s", str, str.length(), str.getBytes().length, Arrays.toString(str.getBytes())));
    }

on invoking fun("ó");调用fun("ó"); its output is它的输出是

ó | length in String: 1 | length in bytes: 2 | bytes: [-61, -77]

so it means character ó needs 2 bytes to represent and as per Character class documentation too default is UTF-16 in java, considering that when I do following所以这意味着字符 ó 需要 2 个字节来表示,并且根据 Character 类文档,Java 中的默认值也是 UTF-16,考虑到当我执行以下操作时

System.out.println(new String("ó".getBytes(), StandardCharsets.UTF_16));// output=쎳
System.out.println(new String("ó".getBytes(), StandardCharsets.ISO_8859_1));// output=ó
System.out.println(new String("ó".getBytes(), StandardCharsets.US_ASCII));// output=��
System.out.println(new String("ó".getBytes(), StandardCharsets.UTF_8));// output=ó
System.out.println(new String("ó".getBytes(), StandardCharsets.UTF_16BE));// output=쎳
System.out.println(new String("ó".getBytes(), StandardCharsets.UTF_16LE));// output=돃

Why any of UTF_16, UTF_16BE, UTF_16LE charset not able to decode bytes properly, given that bytes are representing a 16 bit length character?为什么任何 UTF_16、UTF_16BE、UTF_16LE 字符集都无法正确解码字节,因为字节代表 16 位长度的字符? And how UTF-8 is able decode it properly given that UTF-8 consider each character only 8 bit long so it should have printed 2 chars(1 char for each byte) like in ISO_8859_1.以及 UTF-8 如何正确解码它,因为 UTF-8 认为每个字符只有 8 位长,所以它应该像 ISO_8859_1 一样打印 2 个字符(每个字节 1 个字符)。

getBytes always returns the bytes encoded in the platform's default charset, which is probably UTF-8 for you. getBytes始终返回以平台默认字符集编码的字节,这对您来说可能是 UTF-8。

Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.使用平台的默认字符集将此 String 编码为字节序列,并将结果存储到新的字节数组中。

So you are essentially trying to decode a bunch of UTF-8 bytes with non-UTF-8 charsets.因此,您实际上是在尝试使用非 UTF-8 字符集解码一堆 UTF-8 字节。 No wonder you don't get expected results.难怪你没有得到预期的结果。

Though kind of pointless, you can get what you want by passing the desired charset to getBytes , so that the string is encoded correctly.虽然有点无意义,但您可以通过将所需的字符集传递给getBytes来获得所需的内容,以便正确编码字符串。

    System.out.println(new String("ó".getBytes(StandardCharsets.UTF_16), StandardCharsets.UTF_16));
    System.out.println(new String("ó".getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.ISO_8859_1));
    System.out.println(new String("ó".getBytes(StandardCharsets.US_ASCII), StandardCharsets.US_ASCII));
    System.out.println(new String("ó".getBytes(StandardCharsets.UTF_8), StandardCharsets.UTF_8));
    System.out.println(new String("ó".getBytes(StandardCharsets.UTF_16BE), StandardCharsets.UTF_16BE));
    System.out.println(new String("ó".getBytes(StandardCharsets.UTF_16LE), StandardCharsets.UTF_16LE));

You also seem to have some misunderstanding about encodings.您似乎也对编码有一些误解。 It's not just about the number of bytes that a character takes.不仅仅是一个字符占用的字节数。 The byte-count-per-character for two encodings being the same doesn't mean that they are compatible with each other.两种编码的每个字符的字节数相同并不意味着它们彼此兼容。 Also, it is not always one byte per character in UTF-8.此外,在 UTF-8 中每个字符并不总是一个字节。 UTF-8 is a variable-length encoding. UTF-8是一种可变长度编码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM