简体   繁体   English

字符串拆分,包括重音字符的单词

[英]String split, words including accented characters

I'm using this regex: 我正在使用这个正则表达式:

x.split("[^a-zA-Z0-9']+");

This returns an array of strings with letters and/or numbers. 这将返回带有字母和/或数字的字符串数组。

If I use this: 如果我用这个:

String name = "CEN01_Automated_TestCase.java";
String[] names = name.Split.split("[^a-zA-Z0-9']+");

I got: 我有:

CEN01
Automated
TestCase
Java

But if I use this: 但如果我用这个:

String name = "CEN01_Automação_Caso_Teste.java";
String[] names = name.Split.split("[^a-zA-Z0-9']+");

I got: 我有:

CEN01
Automa
o
Caso
Teste
Java

How can I modify this regex to include accented characters? 如何修改此正则表达式以包含重音字符? (á,ã,õ, etc...) (á,ã,õ等......)

From http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html 来自http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html

Categories that behave like the java.lang.Character boolean is methodname methods (except for the deprecated ones) are available through the same \\p{prop} syntax where the specified property has the name javamethodname . 行为类似于java.lang.Character boolean is methodname类别java.lang.Character boolean is methodname方法(已弃用的方法除外)可通过相同的\\p{prop}语法获得,其中指定的属性名称为javamethodname

Since Character class contains isAlphabetic method you can use 由于Character类包含isAlphabetic方法,您可以使用

name.split("[^\\p{IsAlphabetic}0-9']+");

You can also use 你也可以使用

name.split("(?U)[^\\p{Alpha}0-9']+");

but you will need to use UNICODE_CHARACTER_CLASS flag which can be used by adding (?U) in regex. 但是你需要使用UNICODE_CHARACTER_CLASS标志,可以通过在正则表达式中添加(?U)来使用它。

I would check out the Java Documentation on Regular Expressions . 我会查看正则表达式Java文档 There is a unicode section which I believe is what you may be looking for. 有一个unicode部分,我相信你可能正在寻找。

EDIT: Example 编辑:示例

Another way would be to match on the character code you are looking for. 另一种方法是匹配您正在寻找的字符代码。 For example 例如

\uFFFF where FFFF is the hexadecimal number of the character you are trying to match.

Example: \à matches à 示例: \à matches à

Realize that the backslash will need to be escaped in Java if you are using it as a string literal. 如果您将其用作字符串文字,请认识到需要在Java中转义反斜杠。

Read more about it here . 在这里阅读更多相关信息。

You can use this: 你可以用这个:

String[] names = name.split("[^a-zA-Z0-9'\\p{L}]+");

System.out.println(Arrays.toString(names)); Will output: 将输出:

[CEN01, Automação, Caso, Teste, java] [CEN01,Automação,Caso,Teste,java]

See this for more information. 有关更多信息,请参阅

为什么不拆分分隔符?

String[] names = name.split("[_.]");

Instead of blacklisting all the characters you don't want, you could always whitlist the characters you want like : 您可以随时将所需的字符列入白名单,而不是将您不想要的所有字符列入黑名单:

^[^<>%$]*$

The expression [^(many characters here)] just matches any character that is not listed. 表达式[^(这里有很多字符)]只匹配未列出的任何字符。

But that is a personnal opinion. 但那是个人意见。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM