从字节数组构造的Java字符串长度不正确

Question

I'm having a difficult time understanding the rationale behind the semantics of the Java String(byte[]) constructors (Java 6). 我很难理解Java String（byte []）构造函数（Java 6）语义背后的基本原理。 The length of the resulting String object is usually wrong. 生成的String对象的长度通常是错误的。 Perhaps someone here can explain why this makes any sense. 也许这里有人可以解释为什么这有任何意义。

Consider the following small Java program: 考虑以下小型Java程序：

import java.nio.charset.Charset;

public class Test {
    public static void main(String[] args) {
        String abc1 = new String("abc");
        byte[] bytes = new byte[32];

        bytes[0] = 0x61; // 'a'
        bytes[1] = 0x62; // 'b'
        bytes[2] = 0x63; // 'c'
        bytes[3] = 0x00; // NUL

        String abc2 = new String(bytes, Charset.forName("US-ASCII"));

        System.out.println("abc1: \"" + abc1 + "\" length: " + abc1.length());
        System.out.println("abc2: \"" + abc2 + "\" length: " + abc2.length());

        System.out.println("\"" + abc1 + "\" " +
                (abc1.equals(abc2) ? "==" : "!=") + " \"" + abc2 + "\"");
    }
}

The output of this program is: 该程序的输出是：

abc1: "abc" length: 3
abc2: "abc" length: 32
"abc" != "abc"

The documentation for the String byte[] constructor states, "The length of the new String is a function of the charset, and hence may not be equal to the length of the byte array." String byte []构造函数的文档说明，“新String的长度是字符集的函数，因此可能不等于字节数组的长度。” Precisely true indeed, and in the US-ASCII character set, the length of the string "abc" is 3, and not 32. 确实如此，在US-ASCII字符集中，字符串“abc”的长度为3，而不是32。

Strangely, even though abc2 contains no whitespace characters, abc2.trim() returns the same string, but with the length adjusted to the correct value of 3 and abc1.equals(abc2) returns true... Am I missing something obvious? 奇怪的是，即使abc2不包含空格字符，abc2.trim（）返回相同的字符串，但长度调整为正确的值3，abc1.equals（abc2）返回true ...我错过了一些明显的东西吗？

Yes, I realize I can pass in an explicit length into the constructor, I'm just trying to understand the default semantics. 是的，我意识到我可以将明确的长度传递给构造函数，我只是想了解默认的语义。

Answer 1

In Java, strings are not null-delimited. 在Java中，字符串不是以null分隔的。 The string that is constructed from the byte array uses the entire length of the array. 从字节数组构造的字符串使用数组的整个长度。 Since 0x00 converts one-to-one to the character '\\0' , the resulting string has the same length as the entire array—32. 由于0x00将一对一转换为字符'\\0' ，因此生成的字符串与整个数组-32的长度相同。 When it is printed to System.out, null characters have zero width, so it looks like "abc" but it is really "abc\\0\\0\\0..." (for 32 characters). 当它打印到System.out时，空字符的宽度为零，因此它看起来像“abc”但它实际上是“abc \\ 0 \\ 0 \\ 0 ...”（对于32个字符）。

The reason trim() fixes this is that it considers '\\0' to be white space. trim()修复此问题的原因是它将'\\0'视为空格。

Note that if you want to convert a null-delimited byte representation of a string to a String , you will need to find the index at which to stop. 请注意，如果要将字符串的空分隔字节表示形式转换为String ，则需要查找要停止的索引。 Then (as @Brian notes in his comment), you can use a different String constructor: 然后（正如@Brian在他的评论中所说），你可以使用不同的String构造函数：

String abc2 = new String(bytes, 0, indexOfFirstNull, Charset.forName("US-ASCII"));

However, this must be done with caution. 但是，必须谨慎行事。 You are using the US-ASCII character set for the platform, where the index of the first zero byte is probably a natural stopping place. 您正在为平台使用US-ASCII字符集，其中第一个零字节的索引可能是一个自然停止的位置。 However, in many character sets (such as UTF-16), zero bytes can occur as a normal part of the actual text. 但是，在许多字符集（例如UTF-16）中，零字节可以作为实际文本的正常部分出现。

Answer 2

The length of the resulting String object is usually wrong. 生成的String对象的长度通常是错误的。

No, it's right - you've just misunderstood what it's mean to do. 不，这是对的 - 你只是误解了它的意思。 It's creating a string based on one character per byte, basically - when you use an encoding of US-ASCII, at least. 它基于每个字节一个字符创建一个字符串，基本上 - 当你使用US-ASCII的编码时，至少。

Strangely, even though abc2 contains no whitespace characters, abc2.trim() returns the same string, but with the length adjusted to the correct value of 3 and abc1.equals(abc2) returns true... Am I missing something obvious? 奇怪的是，即使abc2不包含空格字符，abc2.trim（）返回相同的字符串，但长度调整为正确的值3，abc1.equals（abc2）返回true ...我错过了一些明显的东西吗？

The docs for trim() state (after two conditions which don't apply): trim()状态的文档（在两个不适用的条件之后）：

Otherwise, let k be the index of the first character in the string whose code is greater than '\ ', and let m be the index of the last character in the string whose code is greater than '\ '. 否则，令k为代码大于'\\ u0020'的字符串中第一个字符的索引，并且令m为代码大于'\\ u0020'的字符串中最后一个字符的索引。 A new String object is created, representing the substring of this string that begins with the character at index k and ends with the character at index m-that is, the result of this.substring(k, m+1). 创建一个新的String对象，表示该字符串的子字符串，该字符串以索引k处的字符开头，以索引m处的字符结尾，即this.substring（k，m + 1）的结果。

So trim() basically treats "whitespace" as equivalent to "U+0000 to U+0020 inclusive". 所以trim()基本上将“空白”视为等同于“U + 0000到U + 0020”。 That's a bizarrely inaccurate (read: predating Unicode, basically) representation of "whitespace", but it does explain the behaviour. 这是一个奇怪的不准确（读取：基本上是早期的Unicode）表示“空白”，但它确实解释了这种行为。

Basically what you're seeing is: 基本上你所看到的是：

String trailingNulls = "abc\0\0\0\0\0\0";
String trimmed = trailingNulls.trim();
System.out.println(trimmed.length()); // 3

That has nothing to do with constructing a string from a byte array. 这与从字节数组构造字符串无关。

Answer 3

- First of all String being an Object type in java, equals() method of Object class to compare them.. -首先，String是java中的Object类型，Object类的equals（）方法来比较它们。

Eg: 例如：

"abc" .equals("abc")

- You can remove the \\0 from the resulting string by using trim() method, then you will get the result you want.... -你可以使用trim()方法从结果字符串中删除\\0 ，然后你将得到你想要的结果....

Answer 4

First of all indexes assigned are wrong. 首先分配的索引是错误的。 They should be 他们应该是

        bytes[0] = 0x61; // 'a'
        bytes[1] = 0x62; // 'b'
        bytes[2] = 0x63; // 'c'
        bytes[3] = 0x00; // NUL

If you check the equals method of String class you will come to know the reason. 如果检查String类的equals方法，您将了解原因。 It is iterating over char[] and checking each value if index. 它迭代char[]并检查索引时的每个值。 So if length is different of char[] it will return you false. 因此，如果char[]长度不同，它将返回false.

  while (n-- != 0) {
                if (v1[i++] != v2[j++])
                    return false;
            }

Fix is to use trim 修复是使用trim

 abc2.equals(abc1.trim())

Java doc of String#trim() String＃trim（）的 Java文档

Otherwise, let k be the index of the first character in the string whose code is greater than '\ ', and let m be the index of the last character in the string whose code is greater than '\ ' 否则，让k为字符串中第一个字符的索引，其代码大于'\\ u0020'，并且让m为字符串中代码大于'\\ u0020'的最后一个字符的索引

从字节数组构造的Java字符串长度不正确

问题描述

4 个解决方案

解决方案1
14 已采纳 2012-10-04 14:53:34

解决方案2
5 2012-10-04 14:55:22

解决方案3
0 2012-10-04 14:56:38

解决方案4
0 2012-10-04 14:58:47

从字节数组构造的Java字符串长度不正确

问题描述

4 个解决方案

解决方案1 14 已采纳 2012-10-04 14:53:34

解决方案2 5 2012-10-04 14:55:22

解决方案3 0 2012-10-04 14:56:38

解决方案4 0 2012-10-04 14:58:47

解决方案1
14 已采纳 2012-10-04 14:53:34

解决方案2
5 2012-10-04 14:55:22

解决方案3
0 2012-10-04 14:56:38

解决方案4
0 2012-10-04 14:58:47