简体   繁体   中英

Java regex for special character

I have simple method for extract #hashTag from text:

private String[] buildHashTag(String str) {
        ArrayList<String> allMatches = new ArrayList<String>();
        Matcher m = Pattern.compile("(#\\w+)\\b").matcher(str);
        while (m.find()) {
            allMatches.add(m.group());
        }
        return allMatches.toArray(new String[0]);
    }

The problem is if i send string with special character, for example string "POMERANČ".

Test: INPUT:

#Orange in Czech language mean #pomeranč :-)

OUTPUT:

[#Orange]

But it is FAIL, output must be [#Orange, #pomeranč] . Can you tell me, where is the wrong code? Help me. Thank you.

Add the Pattern.UNICODE_CHARACTER_CLASS modifier or use Pattern.compile("(?U)(#\\\\w+)\\\\b") . Otherwise, \\b and \\w do not match all Unicode characters.

When this flag is specified then the (US-ASCII only) Predefined character classes and POSIX character classes are in conformance with Unicode Technical Standard #18: Unicode Regular Expression Annex C: Compatibility Properties.

Here is a demo :

String str = "#Orange in Czech language mean #pomeranč :-)";
ArrayList<String> allMatches = new ArrayList<String>();
Matcher m = Pattern.compile("(?U)(#\\w+)\\b").matcher(str);
//                           ^^^^
while (m.find()) {
    allMatches.add(m.group());
}
System.out.println(Arrays.toString(allMatches.toArray()));

Output: [#Orange, #pomeranč]

Use negated character class instead

/#[^ ]+/
  • [^ ]+ Negated character class, matches anything other than a space, which will in effect match characters till the next space

Regex Demo

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM