Some punctuation characters are not matched with Pattern.UNICODE_CHARACTER_CLASS flag enabled

Question

I have an issue with matching some of punctuation characters when Pattern.UNICODE_CHARACTER_CLASS flag is enabled.

For sample code is as follows:

final Pattern p = Pattern.compile("\\p{Punct}",Pattern.UNICODE_CHARACTER_CLASS);
final Matcher matcher = p.matcher("+");
System.out.println(matcher.find());

The output is false, although it is explicitly stated in documentation that p{Punct} includes characters such as !"#$%&'()*+,-./:;<=>?@[]^_`{|}~

Apart from '+' sign, the same problem occurs for following characters $+<=>^`|~

When Pattern.UNICODE_CHARACTER_CLASS is removed, it works fine

I will appreciate any hints on that problem

Answer 1

From the documentation :

When this flag is specified then the (US-ASCII only) Predefined character classes and POSIX character classes are in conformance with Unicode Technical Standard #18: Unicode Regular Expression Annex C : Compatibility Properties.

If you take a look at the general category property for UTS35 (Unicode Technical Standard), you'll see a distinction between symbols ( S and sub-categories) and punctuation ( P and sub-categories) in a table under General Category Property .

Quoting:

The most basic overall character property is the General Category, which is a basic categorization of Unicode characters into: Letters, Punctuation, Symbols, Marks, Numbers, Separators, and Other.

If you try your example with \\\\p{S} , with the flag on, it will match.

My guess is that + is not listed under punctuation as an arbitrary (yet semantically appropriate) choice, ie literally punctuation != symbols.

Answer 2

The javadoc states what comes under //p{punc} with the caveat that
POSIX character classes (US-ASCII only)

If you take a look at the punctuation chars in unicode there is no + or $. Take a look at the punctuation chars in unicode at http://www.fileformat.info/info/unicode/category/Po/list.htm .

Some punctuation characters are not matched with Pattern.UNICODE_CHARACTER_CLASS flag enabled

Question

2 answers

solution1
4 2015-08-18 09:35:46

solution2
4 ACCPTED 2015-08-18 09:45:49

Some punctuation characters are not matched with Pattern.UNICODE_CHARACTER_CLASS flag enabled

Question

2 answers

solution1 4 2015-08-18 09:35:46

solution2 4 ACCPTED 2015-08-18 09:45:49

solution1
4 2015-08-18 09:35:46

solution2
4 ACCPTED 2015-08-18 09:45:49