简体   繁体   中英

the regex (?U)\p{Punct} misses some unicode punctuations in java

First of all,i want to remove all punctuations of a String.I wrote the following code.

Pattern pattern = Pattern.compile("\\p{Punct}");
Matcher matcher = pattern.matcher("!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~(hello)");
if (matcher.find())
    System.out.println(matcher.replaceAll(""));

after repalcement i got the output: (hello)

so the pattern matches the One of,"#$%&'()*+.-:/;?<=>:@[]^_`{|}~ which is in accord with the official Docs:https.//docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html

But i want to remove "(" Fullwidth Left Parenthesis U+FF08* and ")" Fullwidth Right Parenthesis U+FF09 as well,so i change my code to this:

Pattern pattern = Pattern.compile("(?U)\\p{Punct}");
        Matcher matcher = pattern.matcher("!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~()");
        if (matcher.find())
            System.out.println(matcher.replaceAll(""));

after repalcement i got the output: $+<=>^`|~

The matcher indeed match "(" Fullwidth Left Parenthesis U+FF08* and ")" Fullwidth Right Parenthesis U+FF09

But miss $+<=>^`|~

I am so confused why did that happen? Can anyone give some help? Thanks in advance!

Unicode (that is when you use (?U) ) and POSIX (when not using (?U) ) disagrees on what counts as a punctuation.

When you don't use (?U) , \p{Punct} matches the POSIX punctuation character class , which is just

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

When you use (?U) , \p{Punct} matches the Unicode Punctuation category , which does not include some of the characters in the above list, namely:

$+<=>^`|~

For example, the Unicode category for $ is "Symbol, Currency", or Sc. See here .

If you want to match $+<=>^`|~, plus all the Unicode punctuations, you can put them both in a character class. You can also just directly use the Unicode category "P", rather than turning on Unicode mode with (?U) .

Pattern pattern = Pattern.compile("[\\p{P}$+<=>^`|~]");
Matcher matcher = pattern.matcher("!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~()");
// you don't need "find" first
System.out.println(matcher.replaceAll(""));

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM