I am trying to validate if a dependency can work with some specific unicode chars called Immutable identifier
: http://www.unicode.org/reports/tr31/#Immutable_Identifier_Syntax
The defintion of "Immutable identifier" chars is
Immutable Identifiers: To meet this requirement, an implementation shall define identifiers to be any non-empty string of characters that contains no character having any of the following property values:
Pattern_White_Space=True
Pattern_Syntax=True
General_Category=Private_Use, Surrogate, or Control
Noncharacter_Code_Point=True
I am able to figure out what's Surrogate
, PRIVATE_USE
and Control
chars in https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html but unable to find the rest. This Unicode doc is also kinda complex to me so I failed to understand it and found the codepoint range for those "immutable identifier" chars:(. can anyone with some context shine some light?
Start with the javadoc of Pattern
, especially the (Unicode) classes table. But it also contains Unicode reference links.
"\\p{Space}" // Whitespace
"\\p{Punct}" // Interpunction
"\\p{M}" // Combined diacritical marks, zero-width accents
And more.
Furthermore you might want to normalize the identifier. "é" can be written as one Unicode code point, or two code points: a latin e
and a zero-width accent. java.text.Normalizer
can do that. Compressed (one code point) seems best.
Please take a look at the UAX .
"\\p{Pattern_Syntax}"
Not sure but Pattern_Syntax chars probably contain []?+*.
, so I would think Interpunction would do too.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.