简体   繁体   中英

Java String to byteArray conversion issue

I am trying to encode/decode a ByteArray to String , but input/output are not matching. Am I doing something wrong?

System.out.println(org.apache.commons.codec.binary.Hex.encodeHexString(by));
String s = new String(by, Charsets.UTF_8);
System.out.println(org.apache.commons.codec.binary.Hex.encodeHexString(s.getBytes(Charsets.UTF_8)));

The output is:

130021000061f8f0001a
130021000061efbfbd

Complete code:

String[] arr = {"13", "00", "21", "00", "00", "61", "F8", "F0", "00", "1A"};        
byte[] by = new byte[arr.length];

for (int i = 0; i < arr.length; i++) {
    by[i] = (byte)(Integer.parseInt(arr[i],16) & 0xff); 
}

System.out.println(org.apache.commons.codec.binary.Hex.encodeHexString(by));

String s = new String(by, Charsets.UTF_8);
System.out.println(org.apache.commons.codec.binary.Hex.encodeHexString(s.getBytes(Charsets.UTF_8)));

The problem here is that f8f0001a isn't a valid UTF-8 byte sequence.

First of all, the f8 opening byte denotes a 5 byte sequence and you've only got four. Secondly, f8 can only be followed by a byte of 8x , 9x , ax or bx form.

Therefore it gets replaced with a unicode replacement character (U+FFFD) , whose byte sequence in UTF-8 is efbfbd .

And there (rightly) is no guarantee that the conversion of an invalid byte sequence to and from a string will result in the same byte sequence. (Note that even with two, seemingly identical strings, you might get different bytes representing them in Unicode, see Unicode equivalence . )

The moral of the story is: if you want to represent bytes, don't convert them to characters, and if you want to represent text, don't use byte arrays.

My UTF-8 is a bit rusty :-), but the sequence F8 F0 is imho not a valid utf-8 encoding.

Look at http://en.wikipedia.org/wiki/Utf-8#Description .

When you build the String from the array of bytes, the bytes are decoded.

Since the bytes from your code does not represent valid characters, the bytes that finally composes the String are not the same your passed as parameter.

public String(byte[] bytes)

Constructs a new String by decoding the specified array of bytes using the platform's default charset. The length of the new String is a function of the charset, and hence may not be equal to the length of the byte array.

The behavior of this constructor when the given bytes are not valid in the default charset is unspecified. The CharsetDecoder class should be used when more control over the decoding process is required.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM