简体   繁体   English

Java对正则表达式库中的非BMP Unicode字符(即代码点> 0xFFFF)的支持?

[英]Java support for non-BMP Unicode characters (i.e. codepoints > 0xFFFF) in their Regular Expression Library?

I'm currently using Java 6 (I don't have the option of moving to Java 7) and I'm trying to use the java.util.regex package to do pattern matching of strings that contain Unicode characters. 我目前正在使用Java 6(我没有选择转移到Java 7),我正在尝试使用java.util.regex包来对包含Unicode字符的字符串进行模式匹配。

I know that java.lang.String supports supplemental characters (ie characters with codepoints > 0xFFFF) (since Java 5), but I don't see a simple way to do do pattern matching with these characters. 我知道java.lang.String支持补充字符(即代码点> 0xFFFF的字符)(自Java 5起),但我没有看到一种简单的方法来与这些字符进行模式匹配。 java.util.regex.Pattern still only allows hexadecimals to be represented using 4 digits (eg \￿) java.util.regex.Pattern仍然只允许使用4位数表示十六进制数(例如\\ uFFFF)

Does anyone know if I'm missing an API here? 有谁知道我在这里错过了一个API吗?

I've never done pattern matching with supplemental characters, but I think it's as simple as encoding them (in patterns and strings) as two 16 bits numbers (a UTF-16 surrogate pair) \\unnnn\\ummmm . 我从来没有用补充字符进行模式匹配,但我认为它就像编码它们(在模式和字符串中)一样简单,就像两个16位数字(一个UTF-16代理对)\\ unnnn \\ ummmm。 java.util.regex should be is clever enough to interpret those two numbers (Java chars) as a single character in patterns and strings (though Java will still see them as two chars, as elements of the string). java.util.regex 应该足够聪明,可以将这两个数字(Java字符)解释为模式和字符串中的单个字符(尽管Java仍会将它们视为两个字符,作为字符串的元素)。

Two links: 两个链接:

Java Unicode encoding Java Unicode编码

http://java.sun.com/developer/technicalArticles/Intl/Supplementary/ http://java.sun.com/developer/technicalArticles/Intl/Supplementary/

From the last link (refering to Java 5) : 从最后一个链接(参考Java 5):

The java.util.regex package has been updated so that both pattern strings and target strings can contain supplementary characters, which will be handled as complete units. java.util.regex包已更新,因此模式字符串和目标字符串都可以包含补充字符,这些字符将作为完整单元处理。

Note also that, if you are using UTF8 as your encoding (for your source files), you can also write them directly (see section "Representing Supplementary Characters in Source Files" in the last link). 另请注意,如果您使用UTF8作为编码(对于源文件),您也可以直接编写它们(请参阅最后一个链接中的“在源文件中表示补充字符”一节)。

For example: 例如:

    String pat1 = ".*\uD840\uDC00{2}.*";
    String s1  = "HI \uD840\uDC00\uD840\uDC00 BYE";
    System.out.println(s1.matches(pat1) + " len=" + s1.length());

    String pat2 = ".*\u0040\u0041{2}.*";
    String s2 = "HI \u0040\u0041\u0040\u0041 BYE";
    System.out.println(s2.matches(pat2) + " len=" + s2.length());

This, compiled with Java 6, prints 这是用Java 6编译的,打印出来的

true len=11
false len=11

which agrees with the above. 与上述内容一致。 In the first case, we have a single code point, represented as a pair of surrogate java chars (two 16 bits chars, one suplemental Unicode character), and the {2} quantifier applies to the pair(=codepoint). 在第一种情况下,我们有一个代码点,表示为一对代理java字符(两个16位字符,一个多余的Unicode字符), {2}量词适用于该对(= codepoint)。 In the second, we have two distinct BMP characters, the quantifier applies to the last one - hence, no match. 在第二个中,我们有两个不同的BMP字符,量词适用于最后一个 - 因此,没有匹配。

Notice, however, that the string length is the same (because Java measures the string length counting Java characters, not Unicode code points). 但请注意,字符串长度是相同的(因为Java测量的字符串长度计算Java字符,而不是Unicode代码点)。

The easiest solution is to use a UTF-8 encoding for your source code. 最简单的解决方案是对源代码使用UTF-8编码。 Then just put the characters in directly. 然后直接将字符放入。 You should never ever ever have to specify separate code units in any program. 您永远不应该在任何程序中指定单独的代码单元。

There is still an issue with character classes, however, because Java's lamely exposed UTF-16 internal encoding messes them up. 然而,字符类仍然存在问题,因为Java蹩脚暴露的UTF-16内部编码会使它们混乱。 You can't use full Unicode until JDK7, where even then you will have to specify logical code points using an indirect \\x{HHHHH} notation. 在JDK7之前不能使用完整的Unicode,即使这样,您也必须使用间接的\\x{HHHHH}表示法指定逻辑代码点。 You still won't be able to have any literal code point in a charclass, but you can dodge it with \\x{H..H} . 您仍然无法在charclass中包含任何文字代码点,但您可以使用\\x{H..H}来避开它。

Imperfect, but it's a lot better than it was. 不完美,但它比它好多了。 UTF-16 is always a compromise. UTF-16始终是妥协。 Systems that use UTF-8 or UTF-32 internally don't have these restrictions. 内部使用UTF-8或UTF-32的系统没有这些限制。 They also never make you specify code units that aren't identical to code points. 它们也永远不会让您指定与代码点不同的代码单元。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM