简体   繁体   English

Java使用UTF-8或UTF-16编码?

[英]Which encoding does Java uses UTF-8 or UTF-16?

I've already read the following posts: 我已经阅读了以下帖子:

  1. What is the Java's internal represention for String? 什么是Java的String内部代表? Modified UTF-8? 修改过的UTF-8? UTF-16? UTF-16?
  2. https://docs.oracle.com/javase/8/docs/api/java/lang/String.html https://docs.oracle.com/javase/8/docs/api/java/lang/String.html

Now consider the code given below: 现在考虑下面给出的代码:

public static void main(String[] args) {
    printCharacterDetails("最");
}

public static void printCharacterDetails(String character){
    System.out.println("Unicode Value for "+character+"="+Integer.toHexString(character.codePointAt(0)));
    byte[] bytes = character.getBytes();
    System.out.println("The UTF-8 Character="+character+"  | Default: Number of Bytes="+bytes.length);
    String stringUTF16 = new String(bytes, StandardCharsets.UTF_16);
    System.out.println("The corresponding UTF-16 Character="+stringUTF16+"  | UTF-16: Number of Bytes="+stringUTF16.getBytes().length);
    System.out.println("----------------------------------------------------------------------------------------");
}

When I tried to debug the line character.getBytes() in the code above, the debugger took me into the getBytes() method of String class and then subsequently into the static byte[] encode(char[] ca, int off, int len) method of StringCoding class. 当我尝试在上面的代码中调试行character.getBytes()时,调试器将我带入String类的getBytes()方法,然后进入static byte[] encode(char[] ca, int off, int len) StringCoding类的方法。 The first line of the encode method ( String csn = Charset.defaultCharset().name(); ) returned "UTF-8" as the default encoding during the debugging. 编码方法的第一行( String csn = Charset.defaultCharset().name(); )在调试期间返回“UTF-8”作为默认编码。 I expected it to be "UTF-16". 我预计它会是“UTF-16”。

The output of the program is: 该计划的输出是:

Unicode Value for 最=6700 The UTF-8 Character=最 | 最大的Unicode值= 6700 UTF-8字符=最| Default: Number of Bytes=3 默认值:字节数= 3

The corresponding UTF-16 Character= | 相应的UTF-16字符= | UTF-16: Number of Bytes=6 UTF-16:字节数= 6

When I converted it to UTF-16 explicitly in the program it took 6 bytes to represent the character. 当我在程序中明确地将其转换为UTF-16时,花费了6个字节来表示该字符。 Shouldn't it use 2 or 4 bytes for UTF-16? 不应该为UTF-16使用2或4个字节吗? Why 6 bytes were used? 为什么使用6个字节?

Where am I going wrong in my understanding? 我的理解在哪里出错了? I use Ubuntu 14.04 and the locale command shows the following: 我使用Ubuntu 14.04,locale命令显示以下内容:

LANG=en_US.UTF-8

Does this mean that JVM decides which encoding to use on the basis of underlying OS or does it use UTF-16 only? 这是否意味着JVM决定在底层操作系统的基础上使用哪种编码,还是仅使用UTF-16? Please help me understand the concept. 请帮我理解这个概念。

Characters are a graphical entity which is part of human culture. 人物是一种图形实体,是人类文化的一部分。 When a computer needs to handle text, it uses a representation of those characters in bytes. 当一台计算机需要处理的文本,它使用的字节这些字符的表示 The exact representation used is called an encoding . 使用的确切表示称为编码

There are many encodings that can represent the same character - either through the Unicode character set, or through other character sets like the various ISO-8859 encodings, or the JIS X 0208. 有许多编码可以表示相同的字符 - 通过Unicode字符集,或通过其他字符集,如各种ISO-8859编码,或JIS X 0208。

Internally, Java uses UTF-16. 在内部,Java使用UTF-16。 This means that each character can be represented by one or two sequences of two bytes. 这意味着每个字符可以由两个字节的一个或两个序列表示。 The character you were using, 最, has the code point U+6700 which is represented in UTF-16 as the byte 0x67 and the byte 0x00. 您使用的字符,最大,代码点为U + 6700,以UTF-16表示为字节0x67和字节0x00。

That's the internal encoding. 这是内部编码。 You can't see it unless you dump your memory and look at the bytes in the dumped image. 除非转储内存并查看转储映像中的字节,否则无法看到它。

But the method getBytes() does not return this internal representation. 但该方法getBytes() 返回该内部表示。 Its documentation says: 它的文件说:

public byte[] getBytes()

Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array. 使用平台的默认字符集将此String编码为字节序列,将结果存储到新的字节数组中。

The "platform's default charset" is what your locale variables say it is. “平台的默认字符集”是您的语言环境变量所说的。 That is, UTF-8 . 也就是说, UTF-8 So it takes the UTF-16 internal representation, and converts that into a different representation - UTF-8. 因此它采用UTF-16内部表示,并将其转换为不同的表示形式 - UTF-8。

Note that 注意

new String(bytes, StandardCharsets.UTF_16);

does not "convert it to UTF-16 explicitly" as you assumed it does. “将其转换为UTF-16明确”当你认为它。 This string constructor takes a sequence of bytes, which is supposed to be in the encoding that you have given in the second argument, and converts it to the UTF-16 representation of whatever characters those bytes represent in that encoding. 此字符串构造函数采用一系列字节,这些字节应该是您在第二个参数中给出的编码,并将其转换为该字节在该编码中表示的任何字符的UTF-16表示形式。

But you have given it a sequence of bytes encoded in UTF-8, and told it to interpret that as UTF-16. 但是你已经给它一个以UTF-8编码的字节序列,并告诉它将其解释为UTF-16。 This is wrong, and you do not get the character - or the bytes - that you expect. 这是错误的,你没有得到你期望的字符 - 或字节 - 。

You can't tell Java how to internally store strings. 您无法告诉Java如何在内部存储字符串。 It always stores them as UTF-16. 它总是将它们存储为UTF-16。 The constructor String(byte[],Charset) tells Java to create a UTF-16 string from an array of bytes that is supposed to be in the given character set. 构造函数String(byte[],Charset)告诉Java从一个字节数组创建一个UTF-16字符串,该字符串应该在给定的字符集中。 The method getBytes(Charset) tells Java to give you a sequence of bytes that represent the string in the given encoding (charset). 方法getBytes(Charset)告诉Java为您提供一个字节序列,表示给定编码(charset)中的字符串。 And the method getBytes() without an argument does the same - but uses your platform's default character set for the conversion. 并且没有参数的方法getBytes()也是如此 - 但是使用平台的默认字符集进行转换。

So you misunderstood what getBytes() gives you. 所以你误解了getBytes()给你的东西。 It's not the internal representation. 不是内部代表。 You can't get that directly. 你不能直接得到它。 only getBytes(StandardCharsets.UTF_16) will give you that, and only because you know that UTF-16 is the internal representation in Java. 只有getBytes(StandardCharsets.UTF_16)会给你这一点,而且只是因为你知道UTF-16是Java中的内部表示。 If a future version of Java decided to represent the characters in a different encoding, then getBytes(StandardCharsets.UTF_16) would not show you the internal representation. 如果Java的未来版本决定以不同的编码表示字符,那么getBytes(StandardCharsets.UTF_16)将不会向您显示内部表示。

Edit: in fact, Java 9 introduced just such a change in internal representation of strings, where, by default, strings whose characters all fall in the ISO-8859-1 range are internally represented in ISO-8859-1, whereas strings with at least one character outside that range are internally represented in UTF-16 as before. 编辑:实际上,Java 9引入了字符串内部表示的这种变化,默认情况下,字符全部落在ISO-8859-1范围内的字符串在内部用ISO-8859-1表示,而字符串用at表示。该范围之外的至少一个字符在内部以UTF-16表示,如前所述。 So indeed, getBytes(StandardCharsets.UTF_16) no longer returns the internal representation. 实际上, getBytes(StandardCharsets.UTF_16)不再返回内部表示。

As stated above, java uses UTF-16 as the encoding for character data. 如上所述,java使用UTF-16作为字符数据的编码。

To which it may be added that the set of representable characters is limited to a proper subset of the entire Unicode character set. 可以添加的是,可表示字符集限于整个Unicode字符集的适当子集。 (I believe java restricts its character set to the Unicode BMP, all of which fit in two bytes of UTF-16.) (我相信java将其字符集限制为Unicode BMP,所有这些都适合UTF-16的两个字节。)

So the encoding applied is indeed UTF-16, but the character set to which it is applied is a proper subset of the entire Unicode character set, and this guarantees that Java always uses two bytes per token in its internal String encodings. 因此应用的编码确实是UTF-16,但应用它的字符集是整个Unicode字符集的适当子集,这保证了Java在其内部字符串编码中始终使用每个标记两个字节。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM