简体   繁体   中英

How to generate Unicode “Immutable Identifiers” chars in Java?

I am trying to validate if a dependency can work with some specific unicode chars called Immutable identifier : http://www.unicode.org/reports/tr31/#Immutable_Identifier_Syntax

The defintion of "Immutable identifier" chars is

Immutable Identifiers: To meet this requirement, an implementation shall define identifiers to be any non-empty string of characters that contains no character having any of the following property values:

Pattern_White_Space=True
Pattern_Syntax=True
General_Category=Private_Use, Surrogate, or Control
Noncharacter_Code_Point=True

I am able to figure out what's Surrogate , PRIVATE_USE and Control chars in https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html but unable to find the rest. This Unicode doc is also kinda complex to me so I failed to understand it and found the codepoint range for those "immutable identifier" chars:(. can anyone with some context shine some light?

Start with the javadoc of Pattern , especially the (Unicode) classes table. But it also contains Unicode reference links.

"\\p{Space}"   // Whitespace
"\\p{Punct}"   // Interpunction
"\\p{M}"       // Combined diacritical marks, zero-width accents

And more.

Furthermore you might want to normalize the identifier. "é" can be written as one Unicode code point, or two code points: a latin e and a zero-width accent. java.text.Normalizer can do that. Compressed (one code point) seems best.


Please take a look at the UAX .

"\\p{Pattern_Syntax}"

Not sure but Pattern_Syntax chars probably contain []?+*. , so I would think Interpunction would do too.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM