[英]Wrong bytes from UTF-16 encoding
I have a character '😭' Unicode value is U+1F62D binary equivalent is 11111011000101101 . 我有一个字符'😭'Unicode值是U + 1F62D二进制当量是11111011000101101 。 Now I want to convert this character to byte array .
现在我想将此字符转换为字节数组。 My steps
我的步骤
1) As binary representation is bigger than 2 bytes I use 4 bytes 1)由于二进制表示大于2个字节,我使用4个字节
XXXXXXXX XXXXXXX1 11110110 00101101 XXXXXXXX XXXXXXX1 11110110 00101101
2) Now I replace all 'X' with '0' 2)现在我用'0'替换所有'X'
00000000 00000001 11110110 00101101 00000000 00000001 11110110 00101101
3) Decimal equivalents 3)十进制等价物
00000000(0) 00000001(1) 11110110(-10) 00101101(45) 00000000(0)00000001(1)11110110(-10)00101101(45)
This is my code 这是我的代码
@Test
public void testUtf16With4Bytes() throws Exception {
assertThat(
new String(
new byte[]{0,1,-10,45},
StandardCharsets.UTF_16BE
),
is("😭")
);
}
This is the output 这是输出
ava.lang.AssertionError:
Expected: is "😭"
but: was ""
What did I miss ? 我错过了什么 ?
You miss that some UTF characters are stored as surrogate pairs : 您错过了一些UTF字符存储为代理项对 :
In UTF-16, characters in ranges U+0000—U+D7FF and U+E000—U+FFFD are stored as a single 16 bits unit.
在UTF-16中,范围U + 0000-U + D7FF和U + E000-U + FFFD中的字符存储为单个16位单元。 Non-BMP characters (range U+10000—U+10FFFF) are stored as “surrogate pairs”, two 16 bits units: an high surrogate (in range U+D800—U+DBFF) followed by a low surrogate (in range U+DC00—U+DFFF).
非BMP字符(范围U + 10000-U + 10FFFF)存储为“代理对”,两个16位单元:高代理(范围U + D800-U + DBFF),后跟低代理(范围U) + DC00-U + DFFF)。 A lone surrogate character is invalid in UTF-16, surrogate characters are always written as pairs (high followed by low).
单独的代理字符在UTF-16中无效,代理字符总是写成对(高后跟低)。
😭 character is U+1F62D
so it falls into U+10000—U+10FFFF
range. 😭字符为
U+1F62D
因此它属于U+10000—U+10FFFF
范围。 It's represented with a surrogate pair U+D83D
U+DE2D
, as byte[]
it would be [-40, 61, -34, 45]
. 它用代理对
U+D83D
U+DE2D
,因为byte[]
它将是[-40, 61, -34, 45]
U+DE2D
, U+DE2D
[-40, 61, -34, 45]
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.