简体   繁体   English

在Java中,如何处理Unicode字符和Java UTF-16代码点?

[英]In Java, how are Unicode chars and Java UTF-16 codepoints handled?

I'm struggling with Unicode characters in Java 10. 我在Java 10中遇到Unicode字符的困扰。
I'm using the java.text.BreakIterator package. 我正在使用java.text.BreakIterator包。 For this output : 对于此输出

myString="a𝓞b"  hex=0061d835dcde0062
myString.length()=4 
myString.codePointCount(0,s.length())=3
BreakIterator output:
    a    hex=0061           
    𝓞    hex=d835dcde          
    b    hex=0062

Seems correct. 似乎正确。

Using the same Java code, then with this output : 使用相同的Java代码,然后输出

myString="G̲íl"  hex=0047033200ed006c  
myString.length()=4 
myString.codePointCount(0,s.length())=4
BreakIterator output:   
    G̲    hex=00470332  
    í    hex=00ed  
    l    hex=006c  

Seems correct too, EXCEPT for the codePointCount=4. 似乎也正确,除了codePointCount = 4。
Why isn't it 3, and is there a means of getting a 3 value without using BreakIterator? 为什么不是3,并且有没有使用BreakIterator来获得3值的方法?

My goal is to determine if all (output) chars of a string are 16-bit, or are surrogate or combining chars present? 我的目标是确定字符串的所有(输出)字符是否都是16位,还是存在替代字符还是组合字符?

"G̲íl" is four code points: U+0047, U+0332, U+00ED, U+006C. “Gíl” 四个代码点:U + 0047,U + 0332,U + 00ED,U + 006C。

U+0332 is a combining character, but it is a separate code point. U + 0332是一个组合字符,但它一个单独的代码点。 That's not the same as your first example, which requires using a surrogate pair (2 UTF-16 code units) to represent U+1D4DE - but the latter is still a single code point . 这与您的第一个示例不同,后者需要使用一个代理对(2个UTF-16代码单元)来表示U + 1D4DE-但后者仍然是单个代码

BreakIterator finds boundaries in text - the two code points here that are combined don't have a boundary between them in that sense. BreakIterator在文本中找到边界-此处组合的两个代码点在这种意义上没有边界。 From the documentation: 从文档中:

Character boundary analysis allows users to interact with characters as they expect to, for example, when moving the cursor through a text string. 字符边界分析使用户可以按照他们期望的方式与字符进行交互,例如,当在文本字符串中移动光标时。 Character boundary analysis provides correct navigation through character strings, regardless of how the character is stored. 字符边界分析提供了在字符串中的正确导航,无论字符如何存储。

So I think everything is working correctly here. 因此,我认为这里一切正常。

A codepoint corresponds to one Unicode character. 一个代码点对应一个Unicode字符。

Java represents Unicode in UTF-16, ie, in 16-bit units. Java以UTF-16(即16位单位)表示Unicode。 Characters with codepoint values larger than U+FFFF are represented by a pair of 'surrogate characters', as in your first example. 与第一个示例一样,代码点值大于U + FFFF的字符由一对“代理字符”表示。 Thus the first result of 3. 因此,第3个结果。

In the second case, you have an example that is not a single Unicode character. 在第二种情况下,您有一个示例,该示例不是单个Unicode字符。 It is one character, LETTER G, followed by another character COMBINING CHARACTER LOW LINE. 它是一个字符,字母G,其后是另一个字符COMBINING CHARACTER LOW LINE。 That is two codepoints per the definition. 每个定义有两个代码点。 Thus the second result of 4. 因此,第二个结果为4。

In general, Unicode has tables of character attributes (I'm not sure if I have the right word here) and it is possible to find out that one of your codepoints is a combining character. 通常,Unicode具有字符属性表(我不确定此处是否有正确的单词),并且有可能发现您的代码点之一是组合字符。

Take a look at the Character class. 看一下Character类。 getType(character) will tell you if a codepoint is a combining character or a surrogate. getType(character)会告诉您代码点是组合字符还是替代字符。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM