简体   繁体   English

Java将Unicode代码点转换为字符串

[英]Java convert unicode code point to string

How can UTF-8 value like =D0=93=D0=B0=D0=B7=D0=B5=D1=82=D0=B0 be converted in Java ?如何在Java转换像=D0=93=D0=B0=D0=B7=D0=B5=D1=82=D0=B0这样的 UTF-8 值?

I have tried something like:我尝试过类似的事情:

Character.toCodePoint((char)(Integer.parseInt("D0", 16)),(char)(Integer.parseInt("93", 16));

but it does not convert to a valid code point.但它不会转换为有效的代码点。

That string is an encoding of bytes in hex, so the best way is to decode the string into a byte[] , then call new String(bytes, StandardCharsets.UTF_8) .该字符串是十六进制字节的编码,因此最好的方法是将字符串解码为byte[] ,然后调用new String(bytes, StandardCharsets.UTF_8)

Update更新

Here is a slightly more direct version of decoding the string, than provided by "sstan" in another answer.这是对字符串进行解码的一个更直接的版本,而不是另一个答案中的“sstan”提供的版本。 Of course both versions are good, so use whichever makes you more comfortable, or write your own version.当然这两个版本都不错,所以使用让你更舒服的那个,或者写你自己的版本。

String src = "=D0=93=D0=B0=D0=B7=D0=B5=D1=82=D0=B0";

assert src.length() % 3 == 0;
byte[] bytes = new byte[src.length() / 3];
for (int i = 0, j = 0; i < bytes.length; i++, j+=3) {
    assert src.charAt(j) == '=';
    bytes[i] = (byte)(Character.digit(src.charAt(j + 1), 16) << 4 |
                      Character.digit(src.charAt(j + 2), 16));
}
String str = new String(bytes, StandardCharsets.UTF_8);

System.out.println(str);

Output输出

Газета

In UTF-8, a single character is not always encoded with the same amount of bytes.在 UTF-8 中,单个字符并不总是使用相同数量的字节进行编码。 Depending on the character, it may require 1, 2, 3, or even 4 bytes to be encoded.根据字符的不同,可能需要 1、2、3 甚至 4 个字节进行编码。 Therefore, it's definitely not a trivial matter to try to map UTF-8 bytes yourself to a Java char which uses UTF-16 encoding, where each char is encoded using 2 bytes.因此,尝试将 UTF-8 字节自己映射到使用 UTF-16 编码的 Java char绝对不是一件小事,其中每个char使用 2 个字节进行编码。 Not to mention that, depending on the character (code point > 0xffff), you may also have to worry about dealing with surrogate characters, which is just one more complication that you can easily get wrong.更不用说,根据字符(代码点 > 0xffff),您可能还需要担心处理代理字符,这只是一种很容易出错的复杂情况。

All this to say that Andreas is absolutely right.这一切都说明Andreas是绝对正确的。 You should focus on parsing your string to a byte array, and then let the built-in libraries convert the UTF-8 bytes to a Java string for you.您应该专注于将字符串解析为字节数组,然后让内置库为您将 UTF-8 字节转换为 Java 字符串。 From a Java String, it's trivial to extract the Unicode code points if that's what you want.如果您想要的话,从 Java 字符串中提取 Unicode 代码点是微不足道的。

Here is some sample code that shows one way this can be achieved:以下是一些示例代码,展示了一种实现方式:

public static void main(String[] args) throws Exception {
    String src = "=D0=93=D0=B0=D0=B7=D0=B5=D1=82=D0=B0";

    // Parse string into hex string tokens.
    String[] tokens = Arrays.stream(src.split("="))
            .filter(s -> s.length() != 0)
            .toArray(String[]::new);

    // Convert the hex string representations to a byte array.
    byte[] utf8bytes = new byte[tokens.length];
    for (int i = 0; i < utf8bytes.length; i++) {
        utf8bytes[i] = (byte) Integer.parseInt(tokens[i], 16);
    }

    // Convert UTF-8 bytes to Java String.
    String str = new String(utf8bytes, StandardCharsets.UTF_8);

    // Display string + individual unicode code points.
    System.out.println(str);
    str.codePoints().forEach(System.out::println);
}

Output:输出:

Газета
1043
1072
1079
1077
1090
1072

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM