简体   繁体   中英

Behavior (safety) of Java String when converting from invalid (for the charset) byte[]?

Is it 100% safe (exception / error free) to convert a byte[] that includes random binary data to a String via the constructor:

new String(bytes);
// -- or --
new String(bytes,"UTF-8");  // Or other charset

My concern is whether invalid UTF-8 bytes will cause an exception or other failure instead of just a possibly partially garbled message.

I have tried some known bad byte values, as they appear to work as expected. Eg:

byte[] bytes = new byte[] {'a','b','c',(byte)0xfe,(byte)0xfe,(byte)0xff,(byte)0xff,'d','e','f'};

String test = new String(bytes,"UTF-8");

System.out.println(test);

Prints "abc????def".

My concern is if certain other combinations can fail in other unexpected ways since I cannot guarantee that I can test every invalid combination.

This is covered in the docs :

This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement string

One thing that will fail, if you're not always using UTF-8, is that it can throw UnsupportedEncodingException .

If you want to twiddle with decoding behavior on bad inputs, use something like

StandardCharsets.UTF_8
  .newDecoder()
  .implOnMalformedInput(CodingErrorAction.REPORT)
  .implOnUnmappableCharacter(CodingErrorAction.REPLACE)
  .implReplaceWith(replacementString)
  .decode(ByteBuffer.wrap(byteArray))
  .toString();

which lets you twiddle all the various knobs involved.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM