[英]String split, words including accented characters
I'm using this regex: 我正在使用这个正则表达式:
x.split("[^a-zA-Z0-9']+");
This returns an array of strings with letters and/or numbers. 这将返回带有字母和/或数字的字符串数组。
If I use this: 如果我用这个:
String name = "CEN01_Automated_TestCase.java";
String[] names = name.Split.split("[^a-zA-Z0-9']+");
I got: 我有:
CEN01
Automated
TestCase
Java
But if I use this: 但如果我用这个:
String name = "CEN01_Automação_Caso_Teste.java";
String[] names = name.Split.split("[^a-zA-Z0-9']+");
I got: 我有:
CEN01
Automa
o
Caso
Teste
Java
How can I modify this regex to include accented characters? 如何修改此正则表达式以包含重音字符? (á,ã,õ, etc...)
(á,ã,õ等......)
From http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html 来自http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
Categories that behave like the
java.lang.Character boolean is
methods (except for the deprecated ones) are available through the samemethodname
\\p{prop}
syntax where the specified property has the namejavamethodname
.行为类似于
java.lang.Character boolean is methodname
类别java.lang.Character boolean is
方法(已弃用的方法除外)可通过相同的methodname
\\p{prop}
语法获得,其中指定的属性名称为javamethodname
。
Since Character
class contains isAlphabetic
method you can use 由于
Character
类包含isAlphabetic
方法,您可以使用
name.split("[^\\p{IsAlphabetic}0-9']+");
You can also use 你也可以使用
name.split("(?U)[^\\p{Alpha}0-9']+");
but you will need to use UNICODE_CHARACTER_CLASS
flag which can be used by adding (?U)
in regex. 但是你需要使用
UNICODE_CHARACTER_CLASS
标志,可以通过在正则表达式中添加(?U)
来使用它。
I would check out the Java Documentation on Regular Expressions . 我会查看正则表达式的Java文档 。 There is a unicode section which I believe is what you may be looking for.
有一个unicode部分,我相信你可能正在寻找。
EDIT: Example 编辑:示例
Another way would be to match on the character code you are looking for. 另一种方法是匹配您正在寻找的字符代码。 For example
例如
\uFFFF where FFFF is the hexadecimal number of the character you are trying to match.
Example: \à matches à
示例:
\à matches à
Realize that the backslash will need to be escaped in Java if you are using it as a string literal. 如果您将其用作字符串文字,请认识到需要在Java中转义反斜杠。
为什么不拆分分隔符?
String[] names = name.split("[_.]");
Instead of blacklisting all the characters you don't want, you could always whitlist the characters you want like : 您可以随时将所需的字符列入白名单,而不是将您不想要的所有字符列入黑名单:
^[^<>%$]*$
The expression [^(many characters here)] just matches any character that is not listed. 表达式[^(这里有很多字符)]只匹配未列出的任何字符。
But that is a personnal opinion. 但那是个人意见。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.