Java中的4字节Unicode字符

Question

I am writing unit tests for my custom StringDatatype, and I need to write down 4 byte unicode character.我正在为我的自定义 StringDatatype 编写单元测试，我需要写下 4 字节的 unicode 字符。 "\\U" - not working (illegal escape character error) for example: U+1F701 (0xf0 0x9f 0x9c 0x81). "\\U" - 不工作（非法转义字符错误），例如：U+1F701 (0xf0 0x9f 0x9c 0x81)。 How it can be written in a string?怎么可以写成字符串？

Answer 1

A Unicode code point is not 4 bytes; Unicode 代码点不是 4 个字节； it is an integer (ranging, at the moment, from U+0000 to U+10FFFF).它是一个整数（目前范围从 U+0000 到 U+10FFFF）。

Your 4 bytes are (wild guess) its UTF-8 encoding version (edit: I was right ).你的 4 个字节是（猜测）它的 UTF-8 编码版本（编辑：我是对的）。

You need to do this:你需要这样做：

final char[] chars = Character.toChars(0x1F701);
final String s = new String(chars);
final byte[] asBytes = s.getBytes(StandardCharsets.UTF_8);

When Java was created, Unicode did not define code points outside the BMP (ie, U+0000 to U+FFFF), which is the reason why a char is only 16 bits long (well, OK, this is only a guess, but I think I'm not far off the mark here); Java 创建时，Unicode 并没有定义 BMP 之外的代码点（即 U+0000 到 U+FFFF），这就是一个char只有 16 位长的原因（好吧，好吧，这只是一个猜测，但是我想我在这里不远了）； since then, well, it had to adapt... And code points outside the BMP need two chars (a leading surrogate and a trailing surrogate -- Java calls these a high and low surrogate respectively).从那时起，它必须适应……并且 BMP 之外的代码点需要两个字符（前导代理和尾随代理——Java 分别将它们称为高代理和低代理）。 There is no character literal in Java allowing to enter code points outside the BMP directly. Java 中没有字符文字允许直接在 BMP 之外输入代码点。

Given that a char is, in fact, a UTF-16 code unit and that there are string literals for these, you can input this "character" in a String as "\?\?" -- or directly as the symbol if your computing environment has support for it.鉴于char实际上是一个 UTF-16 代码单元并且有这些字符串文字，您可以在字符串中输入这个“字符”作为"\?\?" ——或者直接作为符号，如果你计算环境支持它。

See also the CharsetDecoder and CharsetEncoder classes.另请参阅CharsetDecoder和CharsetEncoder类。

See also String.codePointCount() , and, since Java 8, String.codePoints() (inherited from CharSequence ).另请参见String.codePointCount() ，以及自 Java 8 以来的String.codePoints() （继承自CharSequence ）。

Answer 2

String s = "𩸽";

Technically this is one character.从技术上讲，这是一个字符。 But be careful s.length() will returns 2. Also java won't compile String s = '𩸽' .但是要小心s.length()将返回 2。而且 java 不会编译String s = '𩸽' 。 Java don't promise you that String.length() shall returns exact number of characters, it returns just number of java-chars required for store this string. Java 不向您保证String.length()将返回确切数量的字符，它仅返回存储此字符串所需的 java-chars 数。

Real number of characters can be obtained from s.codePointCount(0, s.length()) .可以从s.codePointCount(0, s.length())获得实际的字符数。

Answer 3

jshell> String s = "🏳"; jshell> String s = "🏳"; s ==> "🏳️" s ==> "🏳️"

jshell> s.codePointCount(0, s.length()); jshell> s.codePointCount(0, s.length()); $5 ==> 2 $5 ==> 2

Java中的4字节Unicode字符

问题描述

3 个解决方案

解决方案1
20 已采纳 2014-12-04 06:06:59

解决方案2
9 2018-07-25 07:15:07

解决方案3
0 2020-12-03 03:58:34

Java中的4字节Unicode字符

问题描述

3 个解决方案

解决方案1 20 已采纳 2014-12-04 06:06:59

解决方案2 9 2018-07-25 07:15:07

解决方案3 0 2020-12-03 03:58:34

解决方案1
20 已采纳 2014-12-04 06:06:59

解决方案2
9 2018-07-25 07:15:07

解决方案3
0 2020-12-03 03:58:34