Basically, I need to match words that start with a character from a string. The following is an example:
I am trying to match #this_word but ignore the rest.
I also need the regex to match characters from different languages. I tried this:
#\\s*(\\w+)
but err, it only includes English words.
When I try regex such as the followed:
#(?>\\p{L}\\p{M}*+)+
I get an outofboundsexception
.
Apparently the reason I used to get that error was because I wrote:
matcher.group(1);
Instead of:
matcher.group(0);
If you do not care about digits, just add a (?U)
flag before the pattern:
UNICODE_CHARACTER_CLASS
public static final int UNICODE_CHARACTER_CLASS
Enables the Unicode version of Predefined character classes and POSIX character classes .
When this flag is specified then the (US-ASCII only) Predefined character classes and POSIX character classes are in conformance with Unicode Technical Standard #18: Unicode Regular Expression Annex C: Compatibility Properties.The
UNICODE_CHARACTER_CLASS
mode can also be enabled via the embedded flag expression(?U)
.The flag implies
UNICODE_CASE
, that is, it enables Unicode-aware case folding.
Regex:
Pattern ptrn = Pattern.compile("(?U)#\\w+");
See IDEONE demo
You can actually subtract digits from \\w
with [\\\\w&&[^\\\\d]]
to only match underscores and Unicode letters:
Pattern ptrn = Pattern.compile("#[\\w&&[^\\d]]+", Pattern.UNICODE_CHARACTER_CLASS);
As an alternative, to match any Unicode letter you may use \\p{L}\\p{M}*+
subpattern ( \\p{L}
is a base letter and \\p{M}
matches diacritics). So, to match only letters after #
you can use #(?>\\p{L}\\p{M}*+)+
.
To also support match an underscore, add it as an alternative: #(?>\\p{L}\\p{M}*+|_)+
.
If you do not care about where the diacritic is, use just a character class: #[\\p{L}\\p{M}_]+
.
See this IDEONE demo :
String str = "I am trying to match #эту_строку but ignore the rest.";
Pattern ptrn = Pattern.compile("#(?>\\p{L}\\p{M}*+|_)+");
Matcher matcher = ptrn.matcher(str);
while (matcher.find()) {
System.out.println(matcher.group(0));
}
You can use the following code to capture all Unicode letters (matched by \\p{L}
class):
String ss="I am trying to match #this_word but ignore the rest.";
Matcher m =Pattern.compile("#(\\p{L})+",Pattern.CASE_INSENSITIVE).matcher(ss);
while (m.find()) {
System.out.println(m.group());
}
Use this pattern:
#[^\s]+
This might work. It will match every non-spaced characters in the given String..
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.