简体繁体 English

Java中String的字符编码是什么？

[英]What is the character encoding of String in Java?

原文 2010-12-15 18:00:36 2 4 java/ string/ character-encoding

I am actually confused regarding the encoding of strings in Java. 我对Java中的字符串编码感到困惑。 I have a couple of questions. 我有一些问题。 Please help me if you know the answer to them: 如果您知道答案，请帮助我：

1) What is the native encoding of Java strings in memory? 1）内存中Java字符串的本机编码是什么？ When I write String a = "Hello" in which format will it be stored? 当我写String a = "Hello"时，它将以哪种格式存储？ Since Java is machine independent I don't think the system will do the encoding. 由于Java与机器无关，我认为系统不会进行编码。

2) I read on the net that "UTF-16" is the default encoding but I got confused because say when I write that int a = 'c' I get the number of the character in the ASCII table. 2）我在网上读到“UTF-16”是默认编码，但我感到很困惑因为当我写这个int a = 'c'我得到了ASCII表中字符的编号。 So are ASCII and UTF-16 the same? 那么ASCII和UTF-16是一样的吗？

3) Also I wasn't sure on what the storage of a string in the memory depends: OS, language? 3）我还不确定内存中字符串的存储取决于：操作系统，语言？

4 个解决方案

Java stores strings as UTF-16 internally. Java在内部将字符串存储为UTF-16。
"default encoding" isn't quite right. “默认编码”并不完全正确。 Java stores strings as UTF-16 internally, but the encoding used externally, the "system default encoding", varies from platform to platform, and can even be altered by things like environment variables on some platforms. Java在内部将字符串存储为UTF-16，但外部使用的编码“系统默认编码”因平台而异，甚至可能会被某些平台上的环境变量等内容所改变。
ASCII is a subset of Latin 1 which is a subset of Unicode. ASCII是Latin 1的子集，它是Unicode的子集。 UTF-16 is a way of encoding Unicode. UTF-16是一种编码Unicode的方法。 So if you perform your int i = 'x' test for any character that falls in the ASCII range you'll get the ASCII value. 因此，如果对任何属于ASCII范围的字符执行int i = 'x'测试，您将获得ASCII值。 UTF-16 can represent a lot more characters than ASCII, however. 但是，UTF-16可以表示比ASCII更多的字符。
From the java.lang.Character docs : 来自java.lang.Character文档：

The Java 2 platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. Java 2平台在char数组和String和StringBuffer类中使用UTF-16表示。

So it's defined as part of the Java 2 platform that UTF-16 is used for these classes. 因此，它被定义为Java 2平台的一部分，UTF-16用于这些类。

1) Strings are objects, which typically contain a char array and the strings's length. 1）字符串是对象，通常包含char数组和字符串的长度。 The character array is usually implemented as a contiguous array of 16-bit words, each one containing a Unicode character in native byte order. 字符数组通常实现为16位字的连续数组，每个字包含本机字节顺序的Unicode字符。

2) Assigning a character value to an integer converts the 16-bit Unicode character code into its integer equivalent. 2）将字符值分配给整数将16位Unicode字符代码转换为等效的整数。 Thus 'c' , which is U+0063, becomes 0x0063 , or 99. 因此， 'c' ，即U + 0063，变为0x0063或99。

3) Since each String is an object, it contains other information than its class members (eg, class descriptor word, lock/semaphore word, etc.). 3）由于每个String都是一个对象，它包含除其类成员之外的其他信息（例如，类描述符字，锁/信号量字等）。

ADENDUM ADENDUM
The object contents depend on the JVM implementation (which determines the inherent overhead associated with each object), and how the class is actually coded (ie, some libraries may be more efficient than others). 对象内容取决于JVM实现（确定与每个对象相关的固有开销），以及如何实际编码类（即，某些库可能比其他库更有效）。

EXAMPLE 例
A typical implementation will allocate an overhead of two words per object instance (for the class descriptor/pointer, and a semaphore/lock control word); 典型的实现将为每个对象实例分配两个字的开销（对于类描述符/指针和信号量/锁控制字）; a String object also contains an int length and a char[] array reference. String对象还包含int length和char[]数组引用。 The actual character contents of the string are stored in a second object, the char[] array, which in turn is allocated two words, plus an array length word, plus as many 16-bit char elements as needed for the string (plus any extra chars that were left hanging around when the string was created). 字符串的实际字符内容存储在第二个对象char[]数组中，而char[]数组又分配了两个单词，加上一个数组长度字，加上字符串所需的16位char元素（加上任何字符串）创建字符串时留下的额外字符）。

ADDENDUM 2 附录2
The case that one char represents one Unicode character is only true in most of the cases. 一个 char表示一个 Unicode字符的情况仅在大多数情况下才为真。 This would imply UCS-2 encoding and true before 2005. But by now Unicode has become larger and Strings have to be encoded using UTF-16 -- where alas a single Unicode character may use two char s in a Java String . 这意味着UCS-2编码在2005年之前是真实的。但是到现在为止，Unicode变得越来越大，并且必须使用UTF-16对字符串进行编码 - 其中单个Unicode字符可以在Java String使用两个 char 。

Take a look at the actual source code for Apache's implementation, eg at: 查看Apache实现的实际源代码，例如：
http://www.docjar.com/html/api/java/lang/String.java.html http://www.docjar.com/html/api/java/lang/String.java.html

While this doesn't answer your question, it is worth noting that... In the java byte code (class file), the string is stored in UTF-8. 虽然这不能回答你的问题，但值得注意的是......在java字节代码（类文件）中，字符串以UTF-8存储。 http://java.sun.com/docs/books/jvms/second_edition/html/ClassFile.doc.html http://java.sun.com/docs/books/jvms/second_edition/html/ClassFile.doc.html

Edit : thanks to LoadMaster for helping me correcting my answer :) 编辑：感谢LoadMaster帮助我纠正我的答案:)

1) All internal String processing is made in UTF-16. 1）所有内部字符串处理都以UTF-16进行。

2) ASCII is a subset of UTF-16. 2）ASCII是UTF-16的子集。

3) Internally in Java is UTF-16. 3）Java内部是UTF-16。 For the rest, it depends on where you are, yes. 其余的，取决于你在哪里，是的。