编码和解码后字符串有所不同

Question

I stumbled across weird behaviour of encoding/decoding string. 我偶然发现了编码/解码字符串的怪异行为。 Have a look at an example: 看一个例子：

@Test
public void testEncoding() {
    String str = "\uDD71"; // {56689}
    byte[] utf16 = str.getBytes(StandardCharsets.UTF_16); // {-2, -1, -1, -3}
    String utf16String = new String(utf16, StandardCharsets.UTF_16); // {65533}
    assertEquals(str, utf16String);
}

I would assume this test will pass, but it is not the case. 我认为该测试将通过，但事实并非如此。 Could someone explain why the encoded and decoded string is not equal to the original one? 有人可以解释为什么编码和解码的字符串不等于原始字符串吗？

Answer 1

U+DD71 is not a valid codepoint, as U+D800..U+DFFF are reserved by Unicode so as not to cause confusion with UTF-16. U + DD71不是有效的代码点，因为Unicode保留了U + D800..U + DFFF，以免与UTF-16混淆。 As such, these codepoints should never appear as valid character data. 因此，这些代码点绝不应显示为有效的字符数据。 From the Unicode standard: 根据Unicode标准：

Isolated surrogate code points have no interpretation; 孤立的代理代码点没有解释； consequently, no character code charts or names lists are provided for this range. 因此，没有为该范围提供字符代码表或名称列表。

This works, though: 这可以，但是：

@Test
public void testEncoding() {
    String str = "\u0040";
    byte[] utf16 = str.getBytes(StandardCharsets.UTF_16);
    String utf16String = new String(utf16, StandardCharsets.UTF_16);
    assertEquals(str, utf16String);
}

So, it's not your code at fault, but that you're trying to use a codepoint that isn't valid. 因此，这不是您的代码有过错，而是您尝试使用无效的代码点。

编码和解码后字符串有所不同

问题描述

1 个解决方案

解决方案1
4 已采纳 2018-05-13 19:57:51

编码和解码后字符串有所不同

问题描述

1 个解决方案

解决方案1 4 已采纳 2018-05-13 19:57:51

解决方案1
4 已采纳 2018-05-13 19:57:51