简体   繁体   中英

4 byte unicode character in Java

I am writing unit tests for my custom StringDatatype, and I need to write down 4 byte unicode character. "\\U" - not working (illegal escape character error) for example: U+1F701 (0xf0 0x9f 0x9c 0x81). How it can be written in a string?

A Unicode code point is not 4 bytes; it is an integer (ranging, at the moment, from U+0000 to U+10FFFF).

Your 4 bytes are (wild guess) its UTF-8 encoding version (edit: I was right ).

You need to do this:

final char[] chars = Character.toChars(0x1F701);
final String s = new String(chars);
final byte[] asBytes = s.getBytes(StandardCharsets.UTF_8);

When Java was created, Unicode did not define code points outside the BMP (ie, U+0000 to U+FFFF), which is the reason why a char is only 16 bits long (well, OK, this is only a guess, but I think I'm not far off the mark here); since then, well, it had to adapt... And code points outside the BMP need two chars (a leading surrogate and a trailing surrogate -- Java calls these a high and low surrogate respectively). There is no character literal in Java allowing to enter code points outside the BMP directly.

Given that a char is, in fact, a UTF-16 code unit and that there are string literals for these, you can input this "character" in a String as "\?\?" -- or directly as the symbol if your computing environment has support for it.

See also the CharsetDecoder and CharsetEncoder classes.

See also String.codePointCount() , and, since Java 8, String.codePoints() (inherited from CharSequence ).

String s = "𩸽";

Technically this is one character. But be careful s.length() will returns 2. Also java won't compile String s = '𩸽' . Java don't promise you that String.length() shall returns exact number of characters, it returns just number of java-chars required for store this string.

Real number of characters can be obtained from s.codePointCount(0, s.length()) .

jshell> String s = "🏳"; s ==> "🏳️"

jshell> s.codePointCount(0, s.length()); $5 ==> 2

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM