简体   繁体   中英

Java Regex pattern to match String from all languages that end with a whitespace

Basically, I need to match words that start with a character from a string. The following is an example:

I am trying to match #this_word but ignore the rest.

I also need the regex to match characters from different languages. I tried this:

#\\s*(\\w+)

but err, it only includes English words.

When I try regex such as the followed:

#(?>\\p{L}\\p{M}*+)+

I get an outofboundsexception .

Edit

Apparently the reason I used to get that error was because I wrote:

 matcher.group(1);

Instead of:

 matcher.group(0);

If you do not care about digits, just add a (?U) flag before the pattern:

UNICODE_CHARACTER_CLASS
public static final int UNICODE_CHARACTER_CLASS

Enables the Unicode version of Predefined character classes and POSIX character classes .

When this flag is specified then the (US-ASCII only) Predefined character classes and POSIX character classes are in conformance with Unicode Technical Standard #18: Unicode Regular Expression Annex C: Compatibility Properties.

The UNICODE_CHARACTER_CLASS mode can also be enabled via the embedded flag expression (?U) .

The flag implies UNICODE_CASE , that is, it enables Unicode-aware case folding.

Regex:

Pattern ptrn = Pattern.compile("(?U)#\\w+");

See IDEONE demo

You can actually subtract digits from \\w with [\\\\w&&[^\\\\d]] to only match underscores and Unicode letters:

Pattern ptrn = Pattern.compile("#[\\w&&[^\\d]]+", Pattern.UNICODE_CHARACTER_CLASS);

Another demo

As an alternative, to match any Unicode letter you may use \\p{L}\\p{M}*+ subpattern ( \\p{L} is a base letter and \\p{M} matches diacritics). So, to match only letters after # you can use #(?>\\p{L}\\p{M}*+)+ .

To also support match an underscore, add it as an alternative: #(?>\\p{L}\\p{M}*+|_)+ .

If you do not care about where the diacritic is, use just a character class: #[\\p{L}\\p{M}_]+ .

See this IDEONE demo :

String str = "I am trying to match #эту_строку but ignore the rest.";
Pattern ptrn = Pattern.compile("#(?>\\p{L}\\p{M}*+|_)+");
Matcher matcher = ptrn.matcher(str);
while (matcher.find()) {
    System.out.println(matcher.group(0));
}

You can use the following code to capture all Unicode letters (matched by \\p{L} class):

String ss="I am trying to match #this_word but ignore the rest.";
        Matcher m =Pattern.compile("#(\\p{L})+",Pattern.CASE_INSENSITIVE).matcher(ss);
        while (m.find()) {            
            System.out.println(m.group());
        }

Use this pattern:

 #[^\s]+

This might work. It will match every non-spaced characters in the given String..

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM