简体   繁体   English

在Java字符串中处理Unicode代理值

[英]Handling Unicode surrogate values in Java strings

Consider the following code: 请考虑以下代码:

byte aBytes[] = { (byte)0xff,0x01,0,0,
                  (byte)0xd9,(byte)0x65,
                  (byte)0x03,(byte)0x04, (byte)0x05, (byte)0x06, (byte)0x07,
                  (byte)0x17,(byte)0x33, (byte)0x74, (byte)0x6f,
                   0, 1, 2, 3, 4, 5,
                   0 };
String sCompressedBytes = new String(aBytes, "UTF-16");
for (int i=0; i<sCompressedBytes.length; i++) {
    System.out.println(Integer.toHexString(sCompressedBytes.codePointAt(i)));
}

Gets the following incorrect output: 获取以下不正确的输出:

ff01, 0, fffd, 506, 717, 3374, 6f00, 102, 304, 500.

However, if the 0xd9 in the input data is changed to 0x9d , then the following correct output is obtained: 然而,如果0xd9在输入数据被改变为0x9d ,则获得下面的正确的输出:

ff01, 0, 9d65, 304, 506, 717, 3374, 6f00, 102, 304, 500.

I realize that the functionality is because of the fact that the byte 0xd9 is a high-surrogate Unicode marker. 我意识到这个功能是因为字节0xd9是一个高代理的Unicode标记。

Question: Is there a way to feed, identify and extract surrogate bytes ( 0xd800 to 0xdfff ) in a Java Unicode string? 问题:有没有办法在Java Unicode字符串中提供,识别和提取代理字节( 0xd8000xdfff )?
Thanks 谢谢

EDIT: This addresses the question from the comment 编辑:这解决了评论中的问题

If you want to encode arbitrary binary data in a string, you should not use a normal text encoding. 如果你想在一个字符串编码任意的二进制数据,你应该使用一个普通的文本编码。 You don't have valid text in that encoding - you just have arbitrary binary data. 您没有该编码中的有效文本 - 您只有任意二进制数据。

Base64 is the way to go here. Base64是去这里的方式。 There's no base64 support directly in Java (in a public class, anyway) but there are various 3rd party libraries you can use, such as the one in the Apache Commons Codec library . 直接在Java中没有base64支持(无论如何在公共类中),但是您可以使用各种第三方库,例如Apache Commons Codec库中的库

Yes, base64 will increase the size of the data - but it'll allow you to decode it later without losing information. 是的,base64将增加数据的大小 - 但它允许您稍后解码它而不会丢失信息。

EDIT: This addresses the original question 编辑:这解决了原始问题

I believe that the problem is that you haven't specified a proper surrogate pair . 我认为问题是你没有指定一个合适的代理 You should specify bytes representing a low surrogate and then a high surrogate. 您应指定代表低代理项的字节,然后指定高代理项。 After that, you should be able to extra the appropriate code point. 之后,您应该能够添加适当的代码点。 In your case, you've given a low surrogate on its own. 在你的情况下,你自己给了一个低代理人。

Here's code to demonstrate this: 这是用于演示此代码的代码:

public class Test
{
    public static void main(String[] args)
        throws Exception // Just for simplicity
    {
        byte[] data = 
        {
            0, 0x41, // A
            (byte) 0xD8, 1, // High surrogate
            (byte) 0xDC, 2, // Low surrogate
            0, 0x42, // B
        };

        String text = new String(data, "UTF-16");

        System.out.printf("%x\r\n", text.codePointAt(0));
        System.out.printf("%x\r\n", text.codePointAt(1));
        // Code point at 2 is part of the surrogate pair
        System.out.printf("%x\r\n", text.codePointAt(3));       
    }
}

Output: 输出:

41
10402
42

Is there a way to feed, identify and extract surrogate bytes (0xd800 to 0xdfff) in a Java Unicode string? 有没有办法在Java Unicode字符串中提供,识别和提取代理字节(0xd800到0xdfff)?

Just because no one has mentioned it, I'll point out that the Character class includes the methods for working with surrogate pairs. 仅仅因为没有人提到它,我将指出Character类包括使用代理对的方法。 Eg isHighSurrogate(char) , codePointAt(CharSequence, int) and toChars(int) . 例如isHighSurrogate(char)codePointAt(CharSequence,int)toChars(int) I realise that this is besides the point of the stated problem. 我意识到这是除了所述问题的重点之外。

new String(aBytes, "UTF-16");

This is a decoding operation that will transform the input data. 这是将转换输入数据的解码操作。 I'm pretty sure it is not legal because the chosen decoding operation requires the input to start with either 0xfe 0xff or 0xff 0xfe (the byte order mark ). 我很确定它不合法,因为所选的解码操作要求输入以0xfe 0xff或0xff 0xfe( 字节顺序标记 )开始。 In addition, not every possible byte value can be decoded correctly because UTF-16 is a variable width encoding . 此外,并非每个可能的字节值都可以正确解码,因为UTF-16是可变宽度编码

If you wanted a symmetric transformation of arbitrary bytes to String and back, you are better off with an 8-bit, single-byte encoding because every byte value is a valid character: 如果您希望将任意字节对称转换为String并返回,则最好使用8位单字节编码,因为每个字节值都是有效字符:

Charset iso8859_15 = Charset.forName("ISO-8859-15");
byte[] data = new byte[256];
for (int i = Byte.MIN_VALUE; i <= Byte.MAX_VALUE; i++) {
  data[i - Byte.MIN_VALUE] = (byte) i;
}
String asString = new String(data, iso8859_15);
byte[] encoded = asString.getBytes(iso8859_15);
System.out.println(Arrays.equals(data, encoded));

Note: the number of characters is going to equal the number of bytes (doubling the size of the data); 注意:字符数将等于字节数(数据大小加倍); the resultant string isn't necessarily going to be printable (containing as it might, a bunch of control characters ). 结果字符串不一定是可打印的(尽管包含一堆控制字符 )。

I'm with Jon , though - putting arbitrary byte sequences into Java strings is almost always a bad idea. 和Jon在一起 - 将任意字节序列放入Java字符串几乎总是一个坏主意。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM