简体   繁体   English

是否可以从非空UTF-8字节数组创建空的Java字符串?

[英]Can a empty java string be created from non-empty UTF-8 byte array?

I'm trying to debug something and I'm wondering if the following code could ever return true 我正在尝试调试某些东西,并且想知道以下代码是否可以返回true

public boolean impossible(byte[] myBytes) {
  if (myBytes.length == 0)
    return false;
  String string = new String(myBytes, "UTF-8");
  return string.length() == 0;
}

Is there some value I can pass in that will return true? 有什么我可以传递的值可以返回true吗? I've fiddled with passing in just the first byte of a 2 byte sequence, but it still produces a single character string. 我只是在传递2字节序列的第一个字节,但它仍然产生单个字符串。

To clarify, this happened on a PowerPC chip on Java 1.4 code compiled through GCJ to a native binary executable. 为了明确起见,这发生在通过GCJ编译为本地二进制可执行文件的Java 1.4代码的PowerPC芯片上。 This basically means that most bets are off. 基本上,这意味着大多数下注都没有。 I'm mostly wondering if Java's 'normal' behaviour, or Java's spec made any promises. 我最想知道的是Java的“正常”行为,还是Java的规范做出了任何承诺。

According to the javadoc for java.util.String, the behavior of new String(byte[], "UTF-8") is not specified when the bytearray contains invalid or unexpected data. 根据java.util.String的javadoc,当字节数组包含无效或意外数据时,未指定新String(byte [],“ UTF-8”)的行为。 If you want more predictability in your resultant string use http://java.sun.com/j2se/1.5.0/docs/api/java/nio/charset/CharsetDecoder.html . 如果您希望结果字符串具有更高的可预测性,请使用http://java.sun.com/j2se/1.5.0/docs/api/java/nio/charset/CharsetDecoder.html

Possibly. 可能吧。

From the Java 5 API docs "The behavior of this constructor when the given bytes are not valid in the given charset is unspecified." 来自Java 5 API文档“未指定给定字符集中给定字节无效时此构造函数的行为”。

I guess that it depends on : Which version of java you're using Which vendor wrote your JVM (Sun, HP, IBM, the open source one, etc) 我猜这取决于:您正在使用哪个Java版本哪个供应商编写了JVM(Sun,HP,IBM,开放源代码等)

Once the docs say "unspecified" all bets are off 一旦文档说“ unspecified”,所有赌注都关闭

Edit: Beaten to it by Trey Take his advice about using a CharsetDecoder 编辑:被Trey击败了关于使用CharsetDecoder的建议

If Java handles the BOM mark correctly (which I'm not sure whether they have fixed it yet), then it should be possible to input a byte array with just the BOM (U+FEFF, which is in UTF-8 the byte sequence EF BB BF) and to get an empty string. 如果Java正确处理了BOM标记 (我不确定他们是否已修复它),那么应该可以仅输入BOM来输入字节数组(U + FEFF,字节序列为UTF-8) EF BB BF)并获取一个空字符串。


Update: 更新:

I tested that method with all values of 1-3 bytes. 我用1-3个字节的所有值测试了该方法。 None of them returned an empty string on Java 1.6. 在Java 1.6上,它们都没有返回空字符串。 Here is the test code that I used with different byte array lenghts: 这是我用于不同字节数组长度的测试代码:

public static void main(String[] args) throws UnsupportedEncodingException {
    byte[] test = new byte[3];
    byte[] end = new byte[test.length];

    if (impossible(test)) {
        System.out.println(Arrays.toString(test));
    }
    do {
        increment(test, 0);
        if (impossible(test)) {
            System.out.println(Arrays.toString(test));
        }
    } while (!Arrays.equals(test, end));

}

private static void increment(byte[] arr, int i) {
    arr[i]++;
    if (arr[i] == 0 && i + 1 < arr.length) {
        increment(arr, i + 1);
    }
}

public static boolean impossible(byte[] myBytes) throws UnsupportedEncodingException {
    if (myBytes.length == 0) {
        return false;
    }
    String string = new String(myBytes, "UTF-8");
    return string.length() == 0;
}

UTF-8 is a variable length encoding scheme, with most "normal" characters being single byte. UTF-8是一种可变长度编码方案,大多数“正常”字符为单字节。 So any given non-empty byte[] will always translate into a String, I'd have thought. 因此,我想过,任何给定的非空byte []都将始终转换为字符串。

If you want to play it says, write a unit test which iterates over every possible byte value, passing in a single-value array of that value, and assert that the string is non-empty. 如果要播放,请编写一个单元测试,对每个可能的字节值进行迭代,传入该值的单值数组,并断言该字符串为非空。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM