I have simple method for extract #hashTag
from text:
private String[] buildHashTag(String str) {
ArrayList<String> allMatches = new ArrayList<String>();
Matcher m = Pattern.compile("(#\\w+)\\b").matcher(str);
while (m.find()) {
allMatches.add(m.group());
}
return allMatches.toArray(new String[0]);
}
The problem is if i send string with special character, for example string "POMERANČ".
Test: INPUT:
#Orange in Czech language mean #pomeranč :-)
OUTPUT:
[#Orange]
But it is FAIL, output must be [#Orange, #pomeranč]
. Can you tell me, where is the wrong code? Help me. Thank you.
Add the Pattern.UNICODE_CHARACTER_CLASS
modifier or use Pattern.compile("(?U)(#\\\\w+)\\\\b")
. Otherwise, \\b
and \\w
do not match all Unicode characters.
When this flag is specified then the (US-ASCII only) Predefined character classes and POSIX character classes are in conformance with Unicode Technical Standard #18: Unicode Regular Expression Annex C: Compatibility Properties.
Here is a demo :
String str = "#Orange in Czech language mean #pomeranč :-)";
ArrayList<String> allMatches = new ArrayList<String>();
Matcher m = Pattern.compile("(?U)(#\\w+)\\b").matcher(str);
// ^^^^
while (m.find()) {
allMatches.add(m.group());
}
System.out.println(Arrays.toString(allMatches.toArray()));
Output: [#Orange, #pomeranč]
Use negated character class instead
/#[^ ]+/
[^ ]+
Negated character class, matches anything other than a space, which will in effect match characters till the next space
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.