简体   繁体   English

在Java中将字节转换为String时,会发生什么情况?

[英]What happens under the hood when bytes converted to String in Java?

I have a problem when trying to convert bytes to String in Java, with code like: 我在尝试将字节转换为Java中的字符串时遇到问题,代码如下:

byte[] bytes = {1, 2, -3};

byte[] transferred = new String(bytes, Charsets.UTF_8).getBytes(Charsets.UTF_8);

and the original bytes are not the same as the transferred bytes, which are respectively 和原始字节与传输的字节不同,分别是

[1, 2, -3]
[1, 2, -17, -65, -67]

I once thought it is due to the UTF-8 charset mapping for the negative "-3". 我曾经认为这是由于负数“ -3”的UTF-8字符集映射所致。 So I change it to "-32". 所以我将其更改为“ -32”。 But the transferred array remains the same! 但是传输的数组保持不变!

[1, 2, -32]
[1, 2, -17, -65, -67] 

So I strongly want to know exactly what happens when I call new String(bytes) :) 所以我非常想知道当我调用新的String(bytes)时会发生什么:)

Not all sequences of bytes are valid in UTF-8. 并非所有字节序列都在UTF-8中有效。

UTF-8 is a smart scheme with a variable number of bytes per code point, the form of every byte indicating how many other bytes follow for the same code point. UTF-8是一种智能方案,每个代码点具有可变数量的字节,每个字节的形式表示同一代码点可跟随多少其他字节。

Refer to this table : 参考此表

表

Now let's see how it applies to your {1, 2, -3} : 现在,让我们看看它如何应用于您的{1, 2, -3}

Bytes 1 (hex 0x01 , binary 00000001 ) and 2 (hex 0x02 , binary 00000010 ) stand alone, no problem. 字节1 (十六进制0x01 ,二进制00000001 )和2 (十六进制0x02 ,二进制00000010 )独立存在,没问题。

Byte -3 (hex 0xFD , binary 11111101 ) is the start byte of a 6-byte sequence (which is actually illegal in the current UTF-8 standard ), but your byte array does not have such a sequence. 字节-3 (十六进制0xFD ,二进制11111101 )是6字节序列的开始字节(在当前UTF-8标准中这实际上是非法的),但是您的字节数组没有这样的序列。

Your UTF-8 is invalid. 您的UTF-8无效。 The Java UTF-8 decoder replaces this invalid byte -3 with Unicode codepoint U+FFFD REPLACEMENT CHARACTER (also see this ). 在Java UTF-8解码器替换该无效字节-3使用Unicode编码点U + FFFD替换字符 (也参见 )。 in UTF-8, codepoint U+FFFD is hex 0xEF 0xBF 0xBD (binary 11101111 10111111 10111101 ), represented in Java as -17, -65, -67 . 在UTF-8中,代码点U + FFFD为十六进制0xEF 0xBF 0xBD 11101111 10111111 10111101 (二进制11101111 10111111 10111101 ),用Java表示为-17, -65, -67

In Java, byte is signed, where negative values are above 127. And those you used (-3 = 0xFD, -32 = 0xE0) are not valid in UTF-8, so they both are converted to Unicode codepoint U+FFFD REPLACEMENT CHARACTER , which is converted back to UTF-8 as 0xEF = -17, 0xBF = -65, 0xBD = -67. 在Java中, byte是带符号的,负值大于127。并且您使用的值(-3 = 0xFD,-32 = 0xE0)在UTF-8中无效,因此它们都被转换为Unicode代码点U+FFFD REPLACEMENT CHARACTER ,然后将其转换回UTF-8,格式为0xEF = -17、0xBF = -65、0xBD = -67。

You cannot expect that random byte values are correctly interpreted as UTF-8 text. 您不能期望将随机字节值正确解释为UTF-8文本。

There is a line in the documentation of the constructor: 构造函数的文档中有一行:

This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement string. 此方法始终使用此字符集的默认替换字符串替换格式错误的输入和不可映射的字符序列。

This is definitely the culprit here, as -3 is invalid in UTF-8. 这绝对是罪魁祸首,因为-3在UTF-8中无效。 By the way, if you are really interested, you can always download the source of the rt.jar , and debug into it. 顺便说一句,如果您真的很感兴趣,可以随时下载rt.jar的源代码并进行调试。

The encoded values you are getting, [-17, -65, -67] correspond to Unicode code point 0xFFFD . 您获得的编码值[-17,-65,-67]对应于Unicode代码点0xFFFD If you look up that code point, the Unicode specification tells you that 0XFFFD "used to replace an incoming character whose value is unknown or unrepresentable in Unicode." 如果您查找该代码点, 则Unicode规范会告诉您0XFFFD “用于替换值未知或无法在Unicode中表示的输入字符”。 And as others have pointed out, -3 without any followup code-units is broken UTF-8, so this character is appropriate. 就像其他人指出的那样,没有任何后续代码单元的-3损坏了UTF-8,因此该字符是适当的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM