[英]Java Regex pattern to match String from all languages that end with a whitespace
Basically, I need to match words that start with a character from a string. 基本上,我需要匹配以字符串中的字符开头的单词。 The following is an example: 以下是一个示例:
I am trying to match #this_word but ignore the rest.
I also need the regex to match characters from different languages. 我还需要正则表达式来匹配来自不同语言的字符。 I tried this: 我尝试了这个:
#\\s*(\\w+)
but err, it only includes English words. 但错误,它仅包含英语单词。
When I try regex such as the followed: 当我尝试如下的正则表达式时:
#(?>\\p{L}\\p{M}*+)+
I get an outofboundsexception
. 我得到了outofboundsexception
。
Apparently the reason I used to get that error was because I wrote: 显然,我曾经得到该错误的原因是因为我写道:
matcher.group(1);
Instead of: 代替:
matcher.group(0);
If you do not care about digits, just add a (?U)
flag before the pattern: 如果您不关心数字,只需在模式前添加(?U)
标志 :
UNICODE_CHARACTER_CLASS
public static final int UNICODE_CHARACTER_CLASS
Enables the Unicode version of Predefined character classes and POSIX character classes . 启用Unicode版本的预定义字符类和POSIX字符类 。
When this flag is specified then the (US-ASCII only) Predefined character classes and POSIX character classes are in conformance with Unicode Technical Standard #18: Unicode Regular Expression Annex C: Compatibility Properties. 指定此标志后,(仅US-ASCII)预定义字符类和POSIX字符类符合Unicode技术标准#18:Unicode正则表达式附件C:兼容性属性。The
UNICODE_CHARACTER_CLASS
mode can also be enabled via the embedded flag expression(?U)
. 还可以通过嵌入式标志表达式(?U)
启用UNICODE_CHARACTER_CLASS
模式。The flag implies
UNICODE_CASE
, that is, it enables Unicode-aware case folding. 该标志表示UNICODE_CASE
,即,它启用了支持Unicode的大小写折叠。
Regex: 正则表达式:
Pattern ptrn = Pattern.compile("(?U)#\\w+");
See IDEONE demo 见IDEONE演示
You can actually subtract digits from \\w
with [\\\\w&&[^\\\\d]]
to only match underscores and Unicode letters: 实际上,您可以使用[\\\\w&&[^\\\\d]]
从\\w
减去数字以仅匹配下划线和Unicode字母:
Pattern ptrn = Pattern.compile("#[\\w&&[^\\d]]+", Pattern.UNICODE_CHARACTER_CLASS);
As an alternative, to match any Unicode letter you may use \\p{L}\\p{M}*+
subpattern ( \\p{L}
is a base letter and \\p{M}
matches diacritics). 另外,要匹配任何Unicode字母,您可以使用\\p{L}\\p{M}*+
子模式( \\p{L}
是基本字母, \\p{M}
与变音符号匹配)。 So, to match only letters after #
you can use #(?>\\p{L}\\p{M}*+)+
. 因此,要只匹配#
之后的字母,可以使用#(?>\\p{L}\\p{M}*+)+
。
To also support match an underscore, add it as an alternative: #(?>\\p{L}\\p{M}*+|_)+
. 要还支持匹配下划线,请将其添加为替代项: #(?>\\p{L}\\p{M}*+|_)+
。
If you do not care about where the diacritic is, use just a character class: #[\\p{L}\\p{M}_]+
. 如果您不关心变音符号在哪里,请仅使用字符类: #[\\p{L}\\p{M}_]+
。
See this IDEONE demo : 观看此IDEONE演示 :
String str = "I am trying to match #эту_строку but ignore the rest.";
Pattern ptrn = Pattern.compile("#(?>\\p{L}\\p{M}*+|_)+");
Matcher matcher = ptrn.matcher(str);
while (matcher.find()) {
System.out.println(matcher.group(0));
}
You can use the following code to capture all Unicode letters (matched by \\p{L}
class): 您可以使用以下代码捕获所有Unicode字母(与\\p{L}
类匹配):
String ss="I am trying to match #this_word but ignore the rest.";
Matcher m =Pattern.compile("#(\\p{L})+",Pattern.CASE_INSENSITIVE).matcher(ss);
while (m.find()) {
System.out.println(m.group());
}
Use this pattern: 使用以下模式:
#[^\s]+
This might work. 这可能有效。 It will match every non-spaced characters in the given String.. 它将匹配给定String中的每个非空格字符。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.