简体   繁体   English

Java String到byteArray的转换问题

[英]Java String to byteArray conversion issue

I am trying to encode/decode a ByteArray to String , but input/output are not matching. 我正在尝试将ByteArray编码/解码为String ,但输入/输出不匹配。 Am I doing something wrong? 难道我做错了什么?

System.out.println(org.apache.commons.codec.binary.Hex.encodeHexString(by));
String s = new String(by, Charsets.UTF_8);
System.out.println(org.apache.commons.codec.binary.Hex.encodeHexString(s.getBytes(Charsets.UTF_8)));

The output is: 输出是:

130021000061f8f0001a
130021000061efbfbd

Complete code: 完整代码:

String[] arr = {"13", "00", "21", "00", "00", "61", "F8", "F0", "00", "1A"};        
byte[] by = new byte[arr.length];

for (int i = 0; i < arr.length; i++) {
    by[i] = (byte)(Integer.parseInt(arr[i],16) & 0xff); 
}

System.out.println(org.apache.commons.codec.binary.Hex.encodeHexString(by));

String s = new String(by, Charsets.UTF_8);
System.out.println(org.apache.commons.codec.binary.Hex.encodeHexString(s.getBytes(Charsets.UTF_8)));

The problem here is that f8f0001a isn't a valid UTF-8 byte sequence. 这里的问题是f8f0001a不是有效的UTF-8字节序列。

First of all, the f8 opening byte denotes a 5 byte sequence and you've only got four. 首先, f8开放字节表示一个5字节的序列,而你只有四个。 Secondly, f8 can only be followed by a byte of 8x , 9x , ax or bx form. 其次, f8后面只能跟一个8x9xaxbx格式的字节。

Therefore it gets replaced with a unicode replacement character (U+FFFD) , whose byte sequence in UTF-8 is efbfbd . 因此它被替换为unicode replacement character (U+FFFD) ,其UTF-8中的字节序列是efbfbd

And there (rightly) is no guarantee that the conversion of an invalid byte sequence to and from a string will result in the same byte sequence. 并且(正确地)不能保证将无效字节序列转换为字符串和从字符串转换将导致相同的字节序列。 (Note that even with two, seemingly identical strings, you might get different bytes representing them in Unicode, see Unicode equivalence . ) (请注意,即使有两个看似相同的字符串,您可能会在Unicode中获得表示它们的不同字节,请参阅Unicode等效 。)

The moral of the story is: if you want to represent bytes, don't convert them to characters, and if you want to represent text, don't use byte arrays. 故事的寓意是:如果要表示字节,不要将它们转换为字符,如果要表示文本,请不要使用字节数组。

My UTF-8 is a bit rusty :-), but the sequence F8 F0 is imho not a valid utf-8 encoding. 我的UTF-8有点生锈:-),但序列F8 F0不是有效的utf-8编码。

Look at http://en.wikipedia.org/wiki/Utf-8#Description . 请查看http://en.wikipedia.org/wiki/Utf-8#Description

When you build the String from the array of bytes, the bytes are decoded. 从字节数组构建String ,将解码字节。

Since the bytes from your code does not represent valid characters, the bytes that finally composes the String are not the same your passed as parameter. 由于代码中的字节不代表有效字符,因此最终组成String的字节与您作为参数传递的字节不同。

public String(byte[] bytes) public String(byte [] bytes)

Constructs a new String by decoding the specified array of bytes using the platform's default charset. 通过使用平台的默认字符集解码指定的字节数组构造一个新的String The length of the new String is a function of the charset, and hence may not be equal to the length of the byte array. String的长度是字符集的函数,因此可能不等于字节数组的长度。

The behavior of this constructor when the given bytes are not valid in the default charset is unspecified. 未指定给定字节在默认字符集中无效时此构造函数的行为。 The CharsetDecoder class should be used when more control over the decoding process is required. 当需要更多地控制解码过程时,应该使用CharsetDecoder类。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM