简体   繁体   English

为什么Java String.length在具有unicode字符的平台上不一致?

[英]Why is Java String.length inconsistent across platforms with unicode characters?

According to the Java documentation for String.length : 根据String.lengthJava文档

public int length() public int length()

Returns the length of this string. 返回此字符串的长度。

The length is equal to the number of Unicode code units in the string. 长度等于字符串中Unicode代码单元的数量。

Specified by: 具体说明:

length in interface CharSequence 接口CharSequence中的长度

Returns: 返回:

the length of the sequence of characters represented by this object. 此对象表示的字符序列的长度。

But then I don't understand why the following program, HelloUnicode.java, produces different results on different platforms. 但后来我不明白为什么以下程序HelloUnicode.java在不同平台上产生不同的结果。 According to my understanding, the number of Unicode code units should be the same, since Java supposedly always represents strings in UTF-16 : 根据我的理解,Unicode代码单元的数量应该相同,因为Java应该总是代表UTF-16中的字符串

public class HelloWorld {

    public static void main(String[] args) {
        String myString = "I have a 🙂 in my string";
        System.out.println("String: " + myString);
        System.out.println("Bytes: " + bytesToHex(myString.getBytes()));
        System.out.println("String Length: " + myString.length());
        System.out.println("Byte Length: " + myString.getBytes().length);
        System.out.println("Substring 9 - 13: " + myString.substring(9, 13));
        System.out.println("Substring Bytes: " + bytesToHex(myString.substring(9, 13).getBytes()));
    }

    // Code from https://stackoverflow.com/a/9855338/4019986
    private final static char[] hexArray = "0123456789ABCDEF".toCharArray();
    public static String bytesToHex(byte[] bytes) {
        char[] hexChars = new char[bytes.length * 2];
        for ( int j = 0; j < bytes.length; j++ ) {
            int v = bytes[j] & 0xFF;
            hexChars[j * 2] = hexArray[v >>> 4];
            hexChars[j * 2 + 1] = hexArray[v & 0x0F];
        }
        return new String(hexChars);
    }

}

The output of this program on my Windows box is: 我的Windows机器上的这个程序的输出是:

String: I have a 🙂 in my string
Bytes: 492068617665206120F09F998220696E206D7920737472696E67
String Length: 26
Byte Length: 26
Substring 9 - 13: 🙂
Substring Bytes: F09F9982

The output on my CentOS 7 machine is: 我的CentOS 7机器的输出是:

String: I have a 🙂 in my string
Bytes: 492068617665206120F09F998220696E206D7920737472696E67
String Length: 24
Byte Length: 26
Substring 9 - 13: 🙂 i
Substring Bytes: F09F99822069

I ran both with Java 1.8. 我用Java 1.8运行了两个。 Same byte length, different String length. 相同的字节长度,不同的字符串长度 Why? 为什么?

UPDATE UPDATE

By replacing the "🙂" in the string with "\?\?", I get the following results: 通过用“\\ uD83D \\ uDE42”替换字符串中的“🙂”,我得到以下结果:

Windows: 视窗:

String: I have a ? in my string
Bytes: 4920686176652061203F20696E206D7920737472696E67
String Length: 24
Byte Length: 23
Substring 9 - 13: ? i
Substring Bytes: 3F2069

CentOS: CentOS的:

String: I have a 🙂 in my string
Bytes: 492068617665206120F09F998220696E206D7920737472696E67
String Length: 24
Byte Length: 26
Substring 9 - 13: 🙂 i
Substring Bytes: F09F99822069

Why "\?\?" ends up being encoded as 0x3F on the Windows machine is beyond me... 为什么“\\ uD83D \\ uDE42”最终被编码为Windows机器上的0x3F超出了我...

Java Versions: Java版本:

Windows: 视窗:

java version "1.8.0_211"
Java(TM) SE Runtime Environment (build 1.8.0_211-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.211-b12, mixed mode)

CentOS: CentOS的:

openjdk version "1.8.0_201"
OpenJDK Runtime Environment (build 1.8.0_201-b09)
OpenJDK 64-Bit Server VM (build 25.201-b09, mixed mode)

Update 2 更新2

Using .getBytes("utf-8") , with the "🙂" embedded in the string literal, here are the outputs. 使用.getBytes("utf-8") ,字符串文字中嵌入“🙂”,这里是输出。

Windows: 视窗:

String: I have a 🙂 in my string
Bytes: 492068617665206120C3B0C5B8E284A2E2809A20696E206D7920737472696E67
String Length: 26
Byte Length: 32
Substring 9 - 13: 🙂
Substring Bytes: C3B0C5B8E284A2E2809A

CentOS: CentOS的:

String: I have a 🙂 in my string
Bytes: 492068617665206120F09F998220696E206D7920737472696E67
String Length: 24
Byte Length: 26
Substring 9 - 13: 🙂 i
Substring Bytes: F09F99822069

So yes it appears to be a difference in system encoding. 所以是的,它似乎是系统编码的差异。 But then that means string literals are encoded differently on different platforms? 但那意味着字符串文字在不同平台上的编码方式不同? That sounds like it could be problematic in certain situations. 听起来在某些情况下可能会出现问题。

Also... where is the byte sequence C3B0C5B8E284A2E2809A coming from to represent the smiley in Windows? 另外......字节序列C3B0C5B8E284A2E2809A来自哪里代表Windows中的笑脸? That doesn't make sense to me. 这对我来说没有意义。

For completeness, using .getBytes("utf-16") , with the "🙂" embedded in the string literal, here are the outputs. 为了完整性,使用.getBytes("utf-16") ,在字符串文字中嵌入“🙂”,这里是输出。

Windows: 视窗:

String: I have a 🙂 in my string
Bytes: FEFF00490020006800610076006500200061002000F001782122201A00200069006E0020006D007900200073007400720069006E0067
String Length: 26
Byte Length: 54
Substring 9 - 13: 🙂
Substring Bytes: FEFF00F001782122201A

CentOS: CentOS的:

String: I have a 🙂 in my string
Bytes: FEFF004900200068006100760065002000610020D83DDE4200200069006E0020006D007900200073007400720069006E0067
String Length: 24
Byte Length: 50
Substring 9 - 13: 🙂 i
Substring Bytes: FEFFD83DDE4200200069

You have to be careful about specifying the encodings: 你必须小心指定编码:

  • when you compile the Java file, it uses some encoding for the source file. 编译Java文件时,它会对源文件使用一些编码。 My guess is that this already broke your original String literal on compilation. 我的猜测是,这已经在编译时破坏了原始的字符串文字。 This can be fixed by using the escape sequence. 这可以通过使用转义序列来修复。
  • after you use the escape sequence, the String.length are the same. 使用转义序列后,String.length是相同的。 The bytes inside the String are also the same, but what you are printing out does not show that. String中的字节也是相同的,但是你要打印的内容并没有显示出来。
  • the bytes printed are different because you called getBytes() and that again uses the environment or platform-specific encoding. 打印的字节不同,因为您调用了getBytes()并再次使用环境或特定于平台的编码。 So it was also broken (replacing unencodable smilies with question mark). 所以它也被打破了(用问号代替不可解码的表情符号)。 You need to call getBytes("UTF-8") to be platform-independent. 您需要将getBytes("UTF-8")调用为与平台无关的。

So to answer the specific questions posed: 所以回答提出的具体问题:

Same byte length, different String length. 相同的字节长度,不同的字符串长度 Why? 为什么?

Because the string literal is being encoded by the java compiler, and the java compiler often uses a different encoding on different systems by default. 因为字符串文字是由java编译器编码的,并且java编译器通常默认在不同的系统上使用不同的编码。 This may result in a different number of character units per Unicode character, which results in a different string length. 这可能导致每个Unicode字符的字符单元数不同,从而导致不同的字符串长度。 Passing the -encoding command line option with the same option across platforms will make them encode consistently. 在平台上使用相同选项传递-encoding命令行选项将使它们一致地编码。

Why "\?\?" ends up being encoded as 0x3F on the Windows machine is beyond me... 为什么“\\ uD83D \\ uDE42”最终被编码为Windows机器上的0x3F超出了我...

It's not encoded as 0x3F in the string. 它不是在字符串中编码为0x3F。 0x3f is the question mark. 0x3f是问号。 Java puts this in when it is asked to output invalid characters via System.out.println or getBytes , which was the case when you encoded literal UTF-16 representations in a string with a different encoding and then tried to print it to the console and getBytes from it. 当Java被要求通过System.out.printlngetBytes输出无效字符时,Java就会出现这种情况,当您在具有不同编码的字符串中编码文字UTF-16表示然后尝试将其打印到控制台时就是这种情况getBytes来自它。

But then that means string literals are encoded differently on different platforms? 但那意味着字符串文字在不同平台上的编码方式不同?

By default, yes. 默认情况下,是的。

Also... where is the byte sequence C3B0C5B8E284A2E2809A coming from to represent the smiley in Windows? 另外......字节序列C3B0C5B8E284A2E2809A来自哪里代表Windows中的笑脸?

This is quite convoluted. 这非常令人费解。 The "🙂" character (Unicode code point U+1F642) is stored in the Java source file with UTF-8 encoding using the byte sequence F0 9F 99 82. The Java compiler then reads the source file using the platform default encoding, Cp1252 (Windows-1252), so it treats these UTF-8 bytes as though they were Cp1252 characters, making a 4-character string by translating each byte from Cp1252 to Unicode, resulting in U+00F0 U+0178 U+2122 U+201A. “🙂”字符(Unicode代码点U + 1F642)使用字节序列F0 9F 99 82以UTF-8编码存储在Java源文件中。然后,Java编译器使用平台默认编码Cp1252读取源文件( Windows-1252),因此它将这些UTF-8字节视为Cp1252字符,通过将每个字节从Cp1252转换为Unicode来生成4个字符的字符串,从而生成U + 00F0 U + 0178 U + 2122 U + 201A。 The getBytes("utf-8") call then converts this 4-character string into bytes by encoding them as utf-8. 然后, getBytes("utf-8")调用将这个4字符的字符串转换为字节,方法是将它们编码为utf-8。 Since every character of the string is higher than hex 7F, each character is converted into 2 or more UTF-8 bytes; 由于字符串的每个字符都高于十六进制7F,因此每个字符被转换为2个或更多UTF-8字节; hence the resulting string being this long. 因此产生的字符串很长。 The value of this string is not significant; 该字符串的值不重要; it's just the result of using an incorrect encoding. 这只是使用不正确编码的结果。

You didn't take into account, that getBytes() returns the bytes in the platform's default encoding. 你没有考虑到,getBytes()返回平台默认编码中的字节。 This is different on windows and centOS. 这在windows和centOS上有所不同。

See also How to Find the Default Charset/Encoding in Java? 另请参见如何在Java中查找默认字符集/编码? and the API documentation on String.getBytes() . String.getBytes()上的API文档。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM