Java Regex模式可匹配所有以空格结尾的所有语言中的字符串

Question

Basically, I need to match words that start with a character from a string. 基本上，我需要匹配以字符串中的字符开头的单词。 The following is an example: 以下是一个示例：

I am trying to match #this_word but ignore the rest.

I also need the regex to match characters from different languages. 我还需要正则表达式来匹配来自不同语言的字符。 I tried this: 我尝试了这个：

#\\s*(\\w+)

but err, it only includes English words. 但错误，它仅包含英语单词。

When I try regex such as the followed: 当我尝试如下的正则表达式时：

#(?>\\p{L}\\p{M}*+)+

I get an outofboundsexception . 我得到了outofboundsexception 。

Edit 编辑

Apparently the reason I used to get that error was because I wrote: 显然，我曾经得到该错误的原因是因为我写道：

 matcher.group(1);

Instead of: 代替：

 matcher.group(0);

Answer 1

If you do not care about digits, just add a (?U) flag before the pattern: 如果您不关心数字，只需在模式前添加(?U)标志：

UNICODE_CHARACTER_CLASS
public static final int UNICODE_CHARACTER_CLASS

Enables the Unicode version of Predefined character classes and POSIX character classes . 启用Unicode版本的预定义字符类和POSIX字符类 。

When this flag is specified then the (US-ASCII only) Predefined character classes and POSIX character classes are in conformance with Unicode Technical Standard #18: Unicode Regular Expression Annex C: Compatibility Properties. 指定此标志后，（仅US-ASCII）预定义字符类和POSIX字符类符合Unicode技术标准＃18：Unicode正则表达式附件C：兼容性属性。

The UNICODE_CHARACTER_CLASS mode can also be enabled via the embedded flag expression (?U) . 还可以通过嵌入式标志表达式(?U)启用UNICODE_CHARACTER_CLASS模式。

The flag implies UNICODE_CASE , that is, it enables Unicode-aware case folding. 该标志表示UNICODE_CASE ，即，它启用了支持Unicode的大小写折叠。

Regex: 正则表达式：

Pattern ptrn = Pattern.compile("(?U)#\\w+");

See IDEONE demo 见IDEONE演示

You can actually subtract digits from \\w with [\\\\w&&[^\\\\d]] to only match underscores and Unicode letters: 实际上，您可以使用[\\\\w&&[^\\\\d]]从\\w减去数字以仅匹配下划线和Unicode字母：

Pattern ptrn = Pattern.compile("#[\\w&&[^\\d]]+", Pattern.UNICODE_CHARACTER_CLASS);

Another demo 另一个演示

As an alternative, to match any Unicode letter you may use \\p{L}\\p{M}*+ subpattern ( \\p{L} is a base letter and \\p{M} matches diacritics). 另外，要匹配任何Unicode字母，您可以使用\\p{L}\\p{M}*+子模式（ \\p{L}是基本字母， \\p{M}与变音符号匹配）。 So, to match only letters after # you can use #(?>\\p{L}\\p{M}*+)+ . 因此，要只匹配#之后的字母，可以使用#(?>\\p{L}\\p{M}*+)+ 。

To also support match an underscore, add it as an alternative: #(?>\\p{L}\\p{M}*+|_)+ . 要还支持匹配下划线，请将其添加为替代项： #(?>\\p{L}\\p{M}*+|_)+ 。

If you do not care about where the diacritic is, use just a character class: #[\\p{L}\\p{M}_]+ . 如果您不关心变音符号在哪里，请仅使用字符类： #[\\p{L}\\p{M}_]+ 。

See this IDEONE demo : 观看此IDEONE演示：

String str = "I am trying to match #эту_строку but ignore the rest.";
Pattern ptrn = Pattern.compile("#(?>\\p{L}\\p{M}*+|_)+");
Matcher matcher = ptrn.matcher(str);
while (matcher.find()) {
    System.out.println(matcher.group(0));
}

Answer 2

You can use the following code to capture all Unicode letters (matched by \\p{L} class): 您可以使用以下代码捕获所有Unicode字母（与\\p{L}类匹配）：

String ss="I am trying to match #this_word but ignore the rest.";
        Matcher m =Pattern.compile("#(\\p{L})+",Pattern.CASE_INSENSITIVE).matcher(ss);
        while (m.find()) {            
            System.out.println(m.group());
        }

Answer 3

Use this pattern: 使用以下模式：

 #[^\s]+

This might work. 这可能有效。 It will match every non-spaced characters in the given String.. 它将匹配给定String中的每个非空格字符。

Java Regex模式可匹配所有以空格结尾的所有语言中的字符串

问题描述

Edit 编辑

3 个解决方案

解决方案1
3 已采纳 2015-11-22 14:28:13

解决方案2
0 2015-11-22 14:03:38

解决方案3
0 2015-11-22 14:12:14

Java Regex模式可匹配所有以空格结尾的所有语言中的字符串

问题描述

Edit 编辑

3 个解决方案

解决方案1 3 已采纳 2015-11-22 14:28:13

解决方案2 0 2015-11-22 14:03:38

解决方案3 0 2015-11-22 14:12:14

解决方案1
3 已采纳 2015-11-22 14:28:13

解决方案2
0 2015-11-22 14:03:38

解决方案3
0 2015-11-22 14:12:14