简体   繁体   English

Java Unicode 编码

[英]Java Unicode encoding

A Java char is 2 bytes (max size of 65,536) but there are 95,221 Unicode characters. Java char2 个字节(最大大小为 65,536),但有95,221 个Unicode 字符。 Does this mean that you can't handle certain Unicode characters in a Java application?这是否意味着您无法在 Java 应用程序中处理某些 Unicode 字符?

Does this boil down to what character encoding you are using?这是否归结为您使用的字符编码?

You can handle them all if you're careful enough.如果你足够小心,你可以处理所有这些。

Java's char is a UTF-16 code unit . Java 的char是一个UTF-16 代码单元 For characters with code-point > 0xFFFF it will be encoded with 2 char s (a surrogate pair).对于代码点 > 0xFFFF 的字符,它将使用 2 个char (代理对)进行编码。

See http://www.oracle.com/us/technologies/java/supplementary-142654.html for how to handle those characters in Java.有关如何在 Java 中处理这些字符的信息,请参阅http://www.oracle.com/us/technologies/java/supplementary-142654.html

(BTW, in Unicode 5.2 there are 107,154 assigned characters out of 1,114,112 slots.) (顺便说一句,在 Unicode 5.2 中,1,114,112 个插槽中有 107,154 个分配的字符。)

Java uses UTF-16 . Java 使用UTF-16 A single Java char can only represent characters from the basic multilingual plane .单个 Java char只能表示来自基本多语言平面的字符。 Other characters have to be represented by a surrogate pair of two char s.其他字符必须由两个char代理对表示。 This is reflected by API methods such as String.codePointAt() .这反映在 API 方法中,例如String.codePointAt()

And yes, this means that a lot of Java code will break in one way or another when used with characters outside the basic multilingual plane.是的,这意味着许多 Java 代码在与基本多语言平面之外的字符一起使用时会以一种或另一种方式中断。

To add to the other answers, some points to remember:要添加到其他答案中,请记住以下几点:

  • A Java char takes always 16 bits . Java char总是16 位

  • A Unicode character , when encoded as UTF-16, takes "almost always" (not always) 16 bits: that's because there are more than 64K unicode characters. Unicode character ,当编码为 UTF-16 时,“几乎总是”(不总是)16 位:这是因为有超过 64K 的 unicode 字符。 Hence, a Java char is NOT a Unicode character (though "almost always" is).因此,Java 字符不是 Unicode 字符(尽管“几乎总是”是)。

  • "Almost always", above, means the 64K first code points of Unicode, range 0x0000 to 0xFFFF ( BMP ), which take 16 bits in the UTF-16 encoding.上面的“几乎总是”表示 Unicode 的第一个 64K 代码点,范围从 0x0000 到 0xFFFF ( BMP ),在 UTF-16 编码中占 16 位。

  • A non-BMP ("rare") Unicode character is represented as two Java chars (surrogate representation).非 BMP(“稀有”)Unicode 字符表示为两个 Java 字符(代理表示)。 This applies also to the literal representation as a string: For example, the character U+20000 is written as "\?\?".这也适用于作为字符串的文字表示: 例如,字符 U+20000 写为“\?\?”。

  • Corolary: string.length() returns the number of java chars, not of Unicode chars.推论: string.length()返回 java 字符的数量,而不是 Unicode 字符的数量。 A string that has just one "rare" unicode character (eg U+20000) would return length() = 2 .只有一个“稀有”unicode 字符(例如 U+20000)的字符串将返回length() = 2 Same consideration applies to any method that deals with char-sequences.同样的考虑适用于任何处理字符序列的方法。

  • Java has little intelligence for dealing with non-BMP unicode characters as a whole. Java 在处理整个非 BMP unicode 字符方面几乎没有什么智能。 There are some utility methods that treat characters as code-points, represented as ints eg: Character.isLetter(int ch) .有一些实用方法将字符视为代码点,表示为整数,例如: Character.isLetter(int ch) Those are the real fully-Unicode methods.这些才是真正的全 Unicode 方法。

You said:你说:

A Java char is 2 bytes (max size of 65,536) but there are 95,221 Unicode characters. Java char 是 2 个字节(最大大小为 65,536),但有 95,221 个 Unicode 字符。

Unicode grows Unicode 增长

Actually, the inventory of characters defined in Unicode has grown dramatically.实际上,Unicode 中定义的字符库存急剧增加。 Unicode continues to grow — and not just because of emojis . Unicode 继续增长——不仅仅是因为表情符号

  • 143,859 characters in Unicode 13 (Java 15, release notes ) Unicode 13 中的 143,859 个字符(Java 15,发行说明
  • 137,994 characters in Unicode 12.1 (Java 13 & 14) Unicode 12.1 (Java 13 & 14) 中的 137,994 个字符
  • 136,755 characters in Unicode 10 (Java 11 & 12) Unicode 10(Java 11 和 12)中的 136,755 个字符
  • 120,737 characters in Unicode 8 (Java 9) Unicode 8 (Java 9) 中的 120,737 个字符
  • 110,182 characters in Unicode 6.2 (Java 8) Unicode 6.2 (Java 8) 中的 110,182 个字符
  • 109,449 characters in Unicode 6.0 (Java 7) Unicode 6.0 (Java 7) 中的 109,449 个字符
  • 96,447 characters in Unicode 4.0 (Java 5 & 6) Unicode 4.0(Java 5 和 6)中的 96,447 个字符
  • 49,259 characters in Unicode 3.0 (Java 1.4) Unicode 3.0 (Java 1.4) 中的 49,259 个字符
  • 38,952 characters in Unicode 2.1 (Java 1.1.7) Unicode 2.1 (Java 1.1.7) 中的 38,952 个字符
  • 38,950 characters in Unicode 2.0 (Java 1.1) Unicode 2.0 (Java 1.1) 中的 38,950 个字符
  • 34,233 characters in Unicode 1.1.5 (Java 1.0) Unicode 1.1.5 (Java 1.0) 中的 34,233 个字符

char is legacy char是遗产

The char type is long outmoded, now legacy . char类型早已过时,现在是legacy

Use code point numbers使用代码点编号

Instead, you should be working with code point numbers.相反,您应该使用代码点编号。


You asked:你问:

Does this mean that you can't handle certain Unicode characters in a Java application?这是否意味着您无法在 Java 应用程序中处理某些 Unicode 字符?

The char type can address less than half of today's Unicode characters. char类型可以处理不到今天的 Unicode 字符的一半。

To represent any Unicode character , use code point numbers.要表示任何 Unicode 字符,请使用代码点编号。 Never use char .永远不要使用char

Every character in Unicode is assigned a code point number. Unicode 中的每个字符都分配有一个代码点编号。 These range over a million, from 0 to 1,114,112.这些范围超过一百万,从 0 到 1,114,112。 Doing the math when comparing to the numbers listed above, this means most of the numbers in that range have not yet been assigned to a character yet.在与上面列出的数字进行比较时进行数学计算,这意味着该范围内的大多数数字尚未分配给一个字符。 Some of those numbers are reserved as Private Use Areas and will never be assigned.其中一些号码被保留为私人使用区,永远不会被分配。

The String class has gained methods for working with code point numbers, as did the Character class. String类获得了处理代码点编号的方法, Character类也是如此。

Get the code point number for any character in a string, by zero-based index number.通过从零开始的索引号获取字符串中任何字符的代码点号。 Here we get 97 for the letter a .这里我们得到97字母a

int codePoint = "Cat".codePointAt( 1 ) ; // 97 = 'a', hex U+0061, LATIN SMALL LETTER A.

For the more general CharSequence rather than String , use Character.codePointAt .对于更通用的CharSequence而不是String ,请使用Character.codePointAt

We can get the Unicode name for a code point number.我们可以获得代码点编号的 Unicode 名称。

String name = Character.getName( 97 ) ; // letter `a`

LATIN SMALL LETTER A拉丁文小写字母 A

We can get a stream of the code point numbers of all the characters in a string.我们可以得到一个字符串中所有字符的代码点编号的流。

IntStream codePointsStream = "Cat".codePoints() ;

We can turn that into a List of Integer objects.我们可以把它变成一个Integer对象List See How do I convert a Java 8 IntStream to a List?请参阅如何将 Java 8 IntStream 转换为列表? . .

List< Integer > codePointsList = codePointsStream.boxed().collect( Collectors.toList() ) ;

Any code point number can be changed into a String of a single character by calling Character.toString .通过调用Character.toString可以将任何代码点编号更改为单个字符的String

String s = Character.toString( 97 ) ; // 97 is `a`, LATIN SMALL LETTER A. 

a一种

We can produce a String object from an IntStream of code point numbers.我们可以从代码点编号的IntStream生成一个String对象。 See Make a string from an IntStream of code point numbers?请参阅从代码点编号的 IntStream 生成字符串? . .

IntStream intStream = IntStream.of( 67 , 97 , 116 , 32 , 128_008 ); // 32 = SPACE, 128,008 = CAT (emoji).

String output =
        intStream
                .collect(                                     // Collect the results of processing each code point.
                        StringBuilder :: new ,                // Supplier<R> supplier
                        StringBuilder :: appendCodePoint ,    // ObjIntConsumer<R> accumulator
                        StringBuilder :: append               // BiConsumer<R,​R> combiner
                )                                             // Returns a `CharSequence` object.
                .toString();                                  // If you would rather have a `String` than `CharSequence`, call `toString`. 

Cat 🐈猫🐈


You asked:你问:

Does this boil down to what character encoding you are using?这是否归结为您使用的字符编码?

Internally, a String in Java is always using UTF-16 .在内部,Java 中的String始终使用UTF-16

You only use other character encoding when importing or exporting text in or out of Java strings.在从 Java 字符串导入或导出文本时,您只能使用其他字符编码。

So, to answer your question, no, character encoding is not directly related here.所以,回答你的问题,不,字符编码在这里没有直接关系。 Once you get your text into a Java String , it is in UTF-16 encoding and can therefore contain any Unicode character.将文本放入 Java String ,它采用 UTF-16 编码,因此可以包含任何 Unicode 字符。 Of course, to see that character, you must be using a font with a glyph defined for that particular character.当然,要查看该字符,您必须使用带有为该特定字符定义的字形的字体。

When exporting text from Java strings, if you specify a legacy character encoding that cannot represent some of the Unicode characters used in your text, you will have a problem.从 Java 字符串导出文本时,如果指定的旧字符编码无法表示文本中使用的某些 Unicode 字符,则会出现问题。 So use a modern character encoding, which nowadays means UTF-8 as UTF-16 is now considered harmful .所以使用现代字符编码,现在意味着UTF-8因为UTF-16 现在被认为是有害的

Here's Oracle's documentation on Unicode Character Representations .这是 Oracle 关于Unicode Character Representations的文档。 Or, if you prefer, a more thorough documentation here .或者,如果您愿意,可以在此处查看更详尽的文档

The char data type (and therefore the value that a Character object encapsulates) are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities. char 数据类型(以及 Character 对象封装的值)基于原始 Unicode 规范,该规范将字符定义为固定宽度的 16 位实体。 The Unicode standard has since been changed to allow for characters whose representation requires more than 16 bits. Unicode 标准已经更改为允许表示需要超过 16 位的字符。 The range of legal code points is now U+0000 to U+10FFFF, known as Unicode scalar value.合法代码点的范围现在是 U+0000 到 U+10FFFF,称为 Unicode 标量值。 (Refer to the definition of the U+n notation in the Unicode standard.) (请参阅 Unicode 标准中 U+n 符号的定义。)

The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP).从 U+0000 到 U+FFFF 的字符集有时称为基本多语言平面 (BMP)。 Characters whose code points are greater than U+FFFF are called supplementary characters.码位大于 U+FFFF 的字符称为增补字符。 The Java 2 platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. Java 2 平台在 char 数组以及 String 和 StringBuffer 类中使用 UTF-16 表示。 In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range, (\?-\?), the second from the low-surrogates range (\?-\?).在这种表示中,增补字符表示为一对字符值,第一个来自高代理范围 (\?-\?),第二个来自低代理范围 (\?-\?)。

A char value, therefore, represents Basic Multilingual Plane (BMP) code points, including the surrogate code points, or code units of the UTF-16 encoding.因此,char 值表示基本多语言平面 (BMP) 代码点,包括代理代码点或 UTF-16 编码的代码单元。 An int value represents all Unicode code points, including supplementary code points. int 值表示所有 Unicode 代码点,包括补充代码点。 The lower (least significant) 21 bits of int are used to represent Unicode code points and the upper (most significant) 11 bits must be zero. int 的低(最低有效)21 位用于表示Unicode 代码点,高(最高)11 位必须为零。 Unless otherwise specified, the behavior with respect to supplementary characters and surrogate char values is as follows:除非另有说明,关于增补字符和代理字符值的行为如下:

  • The methods that only accept a char value cannot support supplementary characters.仅接受 char 值的方法不能支持增补字符。 They treat char values from the surrogate ranges as undefined characters.他们将代理范围中的 char 值视为未定义的字符。 For example, Character.isLetter('\?') returns false, even though this specific value if followed by any low-surrogate value in a string would represent a letter.例如, Character.isLetter('\?') 返回 false,即使此特定值后跟字符串中的任何低代理值将表示一个字母。
  • The methods that accept an int value support all Unicode characters, including supplementary characters.接受 int 值的方法支持所有 Unicode 字符,包括增补字符。 For example, Character.isLetter(0x2F81A) returns true because the code point value represents a letter (a CJK ideograph).例如,Character.isLetter(0x2F81A) 返回 true,因为代码点值表示一个字母(CJK 表意文字)。

Have a look at the Unicode 4.0 support in J2SE 1.5 article to learn more about the tricks invented by Sun to provide support for all Unicode 4.0 code points.查看J2SE 1.5中的Unicode 4.0 支持文章,了解更多有关 Sun 发明的技巧以提供对所有 Unicode 4.0 代码点的支持。

In summary, you'll find the following changes for Unicode 4.0 in Java 1.5:总之,您会发现 Java 1.5 中 Unicode 4.0 的以下更改:

  • char is a UTF-16 code unit, not a code point char是一个 UTF-16 代码单元,而不是一个代码点
  • new low-level APIs use an int to represent a Unicode code point新的低级 API 使用int来表示 Unicode 代码点
  • high level APIs have been updated to understand surrogate pairs已更新高级 API 以了解代理对
  • a preference towards char sequence APIs instead of char based methods偏好使用字符序列 API 而不是基于字符的方法

Since Java doesn't have 32 bit chars, I'll let you judge if we can call this good Unicode support.由于 Java 没有 32 位字符,我会让您判断我们是否可以称之为良好的 Unicode 支持。

From the OpenJDK7 documentation for String :来自StringOpenJDK7 文档

A String represents a string in the UTF-16 format in which supplementary characters are represented by surrogate pairs (see the section Unicode Character Representations in the Character class for more information).字符串表示 UTF-16 格式的字符串,其中补充字符由代理对表示(有关更多信息,请参阅字符类中的 Unicode 字符表示部分)。 Index values refer to char code units, so a supplementary character uses two positions in a String.索引值指的是字符代码单元,因此增补字符使用字符串中的两个位置。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM