简体   繁体   English

如何转换代理对字符?

[英]How to convert surrogate pair characters?

I have a web service that takes an XML String as input and it is primarily in UTF-8. 我有一个Web服务,它接受XML字符串作为输入,并且主要在UTF-8中。 However, there is a possibility that surrogate pairs can get mixed into the string and those particular characters become unreadable when it is being processed by my application. 但是,代理对可能混入字符串中,并且当我的应用程序正在处理这些特殊字符时,这些特定字符变得不可读。

I am reading in an xml file like so (I have a feeling this part messes things up): 我正在这样读取xml文件(我感觉这部分使事情变得混乱):

String xmlFile = new String(Files.readAllBytes(Paths.get("test.xml")),"UTF-8");

I know that I can detect this with when I loop through every character in the string: 我知道我可以通过遍历字符串中的每个字符来检测到这一点:

Character.isSurrogatePair(high, low)

What I want to know is if there is a way to convert a surrogate pair to something that can be recognized properly in UTF-8. 我想知道的是,是否有一种方法可以将代理对转换为可以在UTF-8中正确识别的对象。 For example "" is recognizable in UTF-8 since it has 3 bytes but "𠃮" has 4 bytes (surrogate pair) but the graphical display is identical. 例如,“”在UTF-8中是可识别的,因为它具有3个字节,而“ U”具有4个字节(代理对),但是图形显示是相同的。

Your code is 100% fine (if the encoding is indeed UTF-8). 您的代码可以100%正确 (如果编码确实为UTF-8)。 Surrogate pairs is a way UTF-16 encode a Unicode code point as two char s (2x2 bytes). 代理对是UTF-16将Unicode代码点编码为两个char (2x2字节)的一种方式。 That is covered by UTF-8 as a longer multibyte sequence (upto 6 bytes, in 2017). UTF-8将其覆盖为较长的多字节序列(2017年最多为6个字节)。

Unicode itself just numbers code points (symbols). Unicode本身只是数字代码点(符号)。 Those numbers are then encoded with UTF- nn so no errors can happen such as searching the byte for / and finding it falsely. 然后,这些数字使用UTF- nn进行编码,因此不会发生错误,例如在字节中搜索/并错误地找到它。 UTF-8 uses high bits, and UTf-16 does a similar trick, with "surrogate pairs." UTF-8使用高位,而UTf-16使用“代理对”进行类似的欺骗。 Unicode & UTF is a solid design. Unicode和UTF是可靠的设计。

Now Unicode did grow over time, and the standards expanded likewise. 现在,Unicode确实随着时间的推移而增长,并且标准也在不断扩展。

So running with java 6 you might not have the same Unicode power (range) as later versions. 因此,使用Java 6运行时,您可能没有与更高版本相同的Unicode功能(范围)。 Likewise old non-java programs and fonts might have their white spots. 同样,旧的非Java程序和字体可能会有白色斑点。

Most likely something in the data is fishy. 数据中最有可能是可疑的。 Reading byte blocks and converting every block to a String would cause invalid characters at block boundaries. 读取字节块并将每个块转换为字符串会在块边界处导致无效字符。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM