简体   繁体   English

编码和解码后字符串有所不同

[英]String differs after encoding and decoding

I stumbled across weird behaviour of encoding/decoding string. 我偶然发现了编码/解码字符串的怪异行为。 Have a look at an example: 看一个例子:

@Test
public void testEncoding() {
    String str = "\uDD71"; // {56689}
    byte[] utf16 = str.getBytes(StandardCharsets.UTF_16); // {-2, -1, -1, -3}
    String utf16String = new String(utf16, StandardCharsets.UTF_16); // {65533}
    assertEquals(str, utf16String);
}

I would assume this test will pass, but it is not the case. 我认为该测试将通过,但事实并非如此。 Could someone explain why the encoded and decoded string is not equal to the original one? 有人可以解释为什么编码和解码的字符串不等于原始字符串吗?

U+DD71 is not a valid codepoint, as U+D800..U+DFFF are reserved by Unicode so as not to cause confusion with UTF-16. U + DD71不是有效的代码点,因为Unicode保留了U + D800..U + DFFF,以免与UTF-16混淆。 As such, these codepoints should never appear as valid character data. 因此,这些代码点绝不应显示为有效的字符数据。 From the Unicode standard: 根据Unicode标准:

Isolated surrogate code points have no interpretation; 孤立的代理代码点没有解释; consequently, no character code charts or names lists are provided for this range. 因此,没有为该范围提供字符代码表或名称列表。

This works, though: 这可以,但是:

@Test
public void testEncoding() {
    String str = "\u0040";
    byte[] utf16 = str.getBytes(StandardCharsets.UTF_16);
    String utf16String = new String(utf16, StandardCharsets.UTF_16);
    assertEquals(str, utf16String);
}

So, it's not your code at fault, but that you're trying to use a codepoint that isn't valid. 因此,这不是您的代码有过错,而是您尝试使用无效的代码点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM