简体   繁体   English

Java中的4字节Unicode字符

[英]4 byte unicode character in Java

I am writing unit tests for my custom StringDatatype, and I need to write down 4 byte unicode character.我正在为我的自定义 StringDatatype 编写单元测试,我需要写下 4 字节的 unicode 字符。 "\\U" - not working (illegal escape character error) for example: U+1F701 (0xf0 0x9f 0x9c 0x81). "\\U" - 不工作(非法转义字符错误),例如:U+1F701 (0xf0 0x9f 0x9c 0x81)。 How it can be written in a string?怎么可以写成字符串?

A Unicode code point is not 4 bytes; Unicode 代码点不是 4 个字节; it is an integer (ranging, at the moment, from U+0000 to U+10FFFF).它是一个整数(目前范围从 U+0000 到 U+10FFFF)。

Your 4 bytes are (wild guess) its UTF-8 encoding version (edit: I was right ).你的 4 个字节是(猜测)它的 UTF-8 编码版本(编辑:我是对的)。

You need to do this:你需要这样做:

final char[] chars = Character.toChars(0x1F701);
final String s = new String(chars);
final byte[] asBytes = s.getBytes(StandardCharsets.UTF_8);

When Java was created, Unicode did not define code points outside the BMP (ie, U+0000 to U+FFFF), which is the reason why a char is only 16 bits long (well, OK, this is only a guess, but I think I'm not far off the mark here); Java 创建时,Unicode 并没有定义 BMP 之外的代码点(即 U+0000 到 U+FFFF),这就是一个char只有 16 位长的原因(好吧,好吧,这只是一个猜测,但是我想我在这里不远了); since then, well, it had to adapt... And code points outside the BMP need two chars (a leading surrogate and a trailing surrogate -- Java calls these a high and low surrogate respectively).从那时起,它必须适应……并且 BMP 之外的代码点需要两个字符(前导代理和尾随代理——Java 分别将它们称为高代理和低代理)。 There is no character literal in Java allowing to enter code points outside the BMP directly. Java 中没有字符文字允许直接在 BMP 之外输入代码点。

Given that a char is, in fact, a UTF-16 code unit and that there are string literals for these, you can input this "character" in a String as "\?\?" -- or directly as the symbol if your computing environment has support for it.鉴于char实际上是一个 UTF-16 代码单元并且有这些字符串文字,您可以在字符串中输入这个“字符”作为"\?\?" ——或者直接作为符号,如果你计算环境支持它。

See also the CharsetDecoder and CharsetEncoder classes.另请参阅CharsetDecoderCharsetEncoder类。

See also String.codePointCount() , and, since Java 8, String.codePoints() (inherited from CharSequence ).另请参见String.codePointCount() ,以及自 Java 8 以来的String.codePoints() (继承自CharSequence )。

String s = "𩸽";

Technically this is one character.从技术上讲,这是一个字符。 But be careful s.length() will returns 2. Also java won't compile String s = '𩸽' .但是要小心s.length()将返回 2。而且 java 不会编译String s = '𩸽' Java don't promise you that String.length() shall returns exact number of characters, it returns just number of java-chars required for store this string. Java 不向您保证String.length()将返回确切数量的字符,它仅返回存储此字符串所需的 java-chars 数。

Real number of characters can be obtained from s.codePointCount(0, s.length()) .可以从s.codePointCount(0, s.length())获得实际的字符数。

jshell> String s = "🏳"; jshell> String s = "🏳"; s ==> "🏳️" s ==> "🏳️"

jshell> s.codePointCount(0, s.length()); jshell> s.codePointCount(0, s.length()); $5 ==> 2 $5 ==> 2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM