简体   繁体   English

java utf8编码 - char,字符串类型

[英]java utf8 encoding - char, string types

public class UTF8 {
    public static void main(String[] args){
        String s = "ヨ"; //0xFF6E
        System.out.println(s.getBytes().length);//length of the string
        System.out.println(s.charAt(0));//first character in the string
    }
}

output: 输出:

3
ヨ

Please help me understand this. 请帮我理解这个。 Trying to understand how utf8 encoding works in java. 试图了解utf8编码在java中的工作原理。 As per java doc definition of char char: The char data type is a single 16-bit Unicode character. 根据char char的 java doc定义:char数据类型是一个16位Unicode字符。

Does it mean char type in java can only support those unicode characters that can be represented with 2 bytes and not more than that? 这是否意味着java中的char类型只能支持那些可以用2个字节表示而不是更多的unicode字符?

In the above program, the no of bytes allocated for that string is 3 but in the third line which returns first character( 2 bytes in java) can hold a character which is 3 bytes long? 在上面的程序中,为该字符串分配的字节数为3,但在第三行中返回第一个字符(java中的2个字节)可以容纳3个字节长的字符? really confused here? 这里真的很困惑?

Any good references regarding this concept in java/ general would be really appreciated. 关于这个概念在java / general中的任何好的参考将非常感激。

Nothing in your code example is directly using UTF-8. 您的代码示例中没有任何内容直接使用UTF-8。 Java strings are encoded in memory using UTF-16 instead. Java字符串使用UTF-16编码在内存中。 Unicode codepoints that do not fit in a single 16-bit char will be encoded using a 2-char pair known as a surrogate pair. 不适合单个16位字符的Unicode代码点将使用称为代理项对的2字符对进行编码。

If you do not pass a parameter value to String.getBytes() , it returns a byte array that has the String contents encoded using the underlying OS's default charset. 如果未将参数值传递给String.getBytes() ,则返回一个字节数组,该数组具有使用底层操作系统的默认字符集编码的String内容。 If you want to ensure a UTF-8 encoded array then you need to use getBytes("UTF-8") instead. 如果要确保UTF-8编码的阵列,则需要使用getBytes("UTF-8")

Calling String.charAt() returns an original UTF-16 encoded char from the String's in-memory storage only. 调用String.charAt()仅从String的内存存储中返回原始UTF-16编码的char。

So in your example, the Unicode character is stored in the String in-memory storage using two bytes that are UTF-16 encoded ( 0x6E 0xFF or 0xFF 0x6E depending on endian), but is stored in the byte array from getBytes() using three bytes that are encoded using whatever the OS default charset is. 因此在您的示例中,Unicode字符使用UTF-16编码的两个字节( 0x6E 0xFF0xFF 0x6E取决于字节序)存储在String内存中,但是使用getBytes()存储在字节数组中使用操作系统默认字符集编码的三个字节。

In UTF-8, that particular Unicode character happens to use 3 bytes as well ( 0xEF 0xBD 0xAE ). 在UTF-8中,该特定Unicode字符恰好也使用3个字节( 0xEF 0xBD 0xAE )。

String.getBytes() returns the bytes using the platform's default character encoding which does not necessary match internal representation. String.getBytes()使用平台的默认字符编码返回字节,该编码不必与内部表示匹配。

You're best of never using this method in most cases, because in most cases it does not make sense to rely on platform's default encoding. 在大多数情况下,你最好不要使用这种方法,因为在大多数情况下,依赖平台的默认编码是没有意义的。 Use String.getBytes(String charsetName) instead and explicit specify the character set that should be used for encoding your String into bytes. 请改用String.getBytes(String charsetName)并显式指定应该用于将String编码为字节的字符集。

UTF-8 is a variable length encoding, that uses only one byte for ASCII chars (values between 0 and 127), and two, three (or even more) bytes for other unicode symbols. UTF-8是一种可变长度编码,它只使用一个字节用于ASCII字符(值在0到127之间),以及两个,三个(或甚至更多)字节用于其他unicode符号。

This is because the higher bit of the byte is used to tell "this is a multi byte sequence", so one bit on 8 is not used to actually represent "real" data (the char code) but to mark the byte. 这是因为字节的较高位用于表示“这是一个多字节序列”,因此8位上的一位不用于实际表示“实际”数据(字符代码),而是用于标记字节。

So, despite Java using 2 bytes in ram for each char, when chars are "serialized" using UTF-8, they may produce one, two or three bytes in the resulting byte array, that's how the UTF-8 encoding works. 因此,尽管Java在ram中为每个char使用2个字节,但是当使用UTF-8“序列化”字符时,它们可能在生成的字节数组中产生一个,两个或三个字节,这就是UTF-8编码的工作方式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM