简体   繁体   English

Java byte []到String和UTF-8

[英]Java byte[] to String and UTF-8

I am implementing a Cipher Block Chaining for school work and the question asks for a method taking String and returning another String . 我正在为学校作业实施密码块链接 ,问题要求一种采用String并返回另一个String At first, I thought it was odd and that byte[] variables would be much more adequate, but implemented a method still. 起初,我认为它很奇怪,而且byte[]变量会更加充分,但仍然实现了一种方法。 Basically, here's the code : 基本上,这是代码:

static public String encode(String message) {
   byte[] dataMessage = message.getBytes();
   ByteArrayOutputStream out = new ByteArrayOutputStream();

   byte last = (byte) (Math.random() * 256);
   byte cur;
   out.write(last);

   for (byte b : data) {
      cur = (byte) (b^last);
      System.out.println("Encode '" + (char) b + "' = " + b + "^" + last + " > " + cur );
      out.write( cur );
      last = cur;
   }

   System.out.println("**ENCODED BYTES = " + Arrays.toString(out.toByteArray()));
   System.out.println("**ENCODED STR   = " + Arrays.toString(out.toString().getBytes()));

   return out.toString();
}

The decode method works similarly. decode方法的工作原理与此类似。 Some times, the method will spit results like 有时,该方法会吐出类似

Encode 'H' = 72^109 > 37
Encode 'e' = 101^37 > 64
Encode 'l' = 108^64 > 44
Encode 'l' = 108^44 > 64
Encode 'o' = 111^64 > 47
**ENCODED BYTES = [109, 37, 64, 44, 64, 47]
**ENCODED STR   = [109, 37, 64, 44, 64, 47]

But sometimes will also spit things like 但有时也会吐出类似的东西

Encode 'H' = 72^-63 > -119
Encode 'e' = 101^-119 > -20
Encode 'l' = 108^-20 > -128
Encode 'l' = 108^-128 > -20
Encode 'o' = 111^-20 > -125
**ENCODED BYTES = [-63, -119, -20, -128, -20, -125]
**ENCODED STR   = [-17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67]

I presume that this has something to do with UTF-8 (the system's default encoding), but I'm not familiar enough to figure out exactly why such a string would be returned with the given bytes. 我认为这与UTF-8(系统的默认编码)有关,但我不太熟悉,无法确定为什么这样的字符串将与给定的字节一起返回。

You can't take an arbitrary sequence of bytes and assume it's a valid UTF-8 encoded string. 您不能采用任意字节序列,并假定它是有效的UTF-8编码的字符串。 So, I suspect that the toString method, as documented , replaces malformed-input and unmappable-character sequences with the default replacement string for the platform's default character set . 所以,我怀疑toString方法,如文档所示用平台默认字符集的默认替换字符串替换格式错误的输入和不可映射字符序列

You should thus not transform purely binary data into a String like this. 因此,您不应该将纯二进制数据转换为像这样的String。 Use some encoding like Hex or Base64 to transform your bytes to a printable string, and vice-versa. 使用十六进制或Base64之类的编码将字节转换为可打印的字符串,反之亦然。

Apache commons-codec has a Base64 utility class. Apache commons编解码器具有Base64实用程序类。

This: 这个:

out.toString().getBytes()

is not doing what you expect. 没有做你期望的事。 It takes the encrypted bytes and interprets those bytes as if they are an UTF-8 encoded string. 它接受加密的字节并将这些字节解释为它们是UTF-8编码的字符串。 Then it converts the characters in that string back to bytes. 然后它将字符串中的字符转换回字节。

You can't just take arbitrary bytes (in this case, the encrypted data) and then handle it as if it is UTF-8 encoded text. 你不能只采取任意字节(在这种情况下,加密数据),然后处理它就好像它是UTF-8编码文本。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM