如何转换代理对字符？

Question

I have a web service that takes an XML String as input and it is primarily in UTF-8. 我有一个Web服务，它接受XML字符串作为输入，并且主要在UTF-8中。 However, there is a possibility that surrogate pairs can get mixed into the string and those particular characters become unreadable when it is being processed by my application. 但是，代理对可能混入字符串中，并且当我的应用程序正在处理这些特殊字符时，这些特定字符变得不可读。

I am reading in an xml file like so (I have a feeling this part messes things up): 我正在这样读取xml文件（我感觉这部分使事情变得混乱）：

String xmlFile = new String(Files.readAllBytes(Paths.get("test.xml")),"UTF-8");

I know that I can detect this with when I loop through every character in the string: 我知道我可以通过遍历字符串中的每个字符来检测到这一点：

Character.isSurrogatePair(high, low)

What I want to know is if there is a way to convert a surrogate pair to something that can be recognized properly in UTF-8. 我想知道的是，是否有一种方法可以将代理对转换为可以在UTF-8中正确识别的对象。 For example "" is recognizable in UTF-8 since it has 3 bytes but "𠃮" has 4 bytes (surrogate pair) but the graphical display is identical. 例如，“”在UTF-8中是可识别的，因为它具有3个字节，而“ U”具有4个字节（代理对），但是图形显示是相同的。

Answer 1

Your code is 100% fine (if the encoding is indeed UTF-8). 您的代码可以100％正确 （如果编码确实为UTF-8）。 Surrogate pairs is a way UTF-16 encode a Unicode code point as two char s (2x2 bytes). 代理对是UTF-16将Unicode代码点编码为两个char （2x2字节）的一种方式。 That is covered by UTF-8 as a longer multibyte sequence (upto 6 bytes, in 2017). UTF-8将其覆盖为较长的多字节序列（2017年最多为6个字节）。

Unicode itself just numbers code points (symbols). Unicode本身只是数字代码点（符号）。 Those numbers are then encoded with UTF- nn so no errors can happen such as searching the byte for / and finding it falsely. 然后，这些数字使用UTF- nn进行编码，因此不会发生错误，例如在字节中搜索/并错误地找到它。 UTF-8 uses high bits, and UTf-16 does a similar trick, with "surrogate pairs." UTF-8使用高位，而UTf-16使用“代理对”进行类似的欺骗。 Unicode & UTF is a solid design. Unicode和UTF是可靠的设计。

Now Unicode did grow over time, and the standards expanded likewise. 现在，Unicode确实随着时间的推移而增长，并且标准也在不断扩展。

So running with java 6 you might not have the same Unicode power (range) as later versions. 因此，使用Java 6运行时，您可能没有与更高版本相同的Unicode功能（范围）。 Likewise old non-java programs and fonts might have their white spots. 同样，旧的非Java程序和字体可能会有白色斑点。

Most likely something in the data is fishy. 数据中最有可能是可疑的。 Reading byte blocks and converting every block to a String would cause invalid characters at block boundaries. 读取字节块并将每个块转换为字符串会在块边界处导致无效字符。

如何转换代理对字符？

问题描述

1 个解决方案

解决方案1
1 2017-10-13 09:57:19

如何转换代理对字符？

问题描述

1 个解决方案

解决方案1 1 2017-10-13 09:57:19

解决方案1
1 2017-10-13 09:57:19