[英]How to filter a Java String to get only alphabet characters?
I'm generating a XML file to make payments and I have a constraint for user's full names. 我正在生成一个XML文件来进行付款,我对用户的全名有约束。 That param only accept alphabet characters (a-ZAZ) + whitespaces to separe names and surnames.
那个参数只接受字母字符(a-ZAZ)+空格来分隔姓名和姓氏。
I'm not able to filter this in a easy way, how can I build a regular expression or a filter to get my desireable output? 我无法以简单的方式过滤这个,我如何构建正则表达式或过滤器以获得我想要的输出?
Example: 例:
'Carmen López-Delina Santos'
must be 'Carmen LopezDelina Santos'
'Carmen López-Delina Santos'
'Carmen LopezDelina Santos'
'Carmen López-Delina Santos'
必须是'Carmen LopezDelina Santos'
I need to transform vowels with decorations in single vowels as follows: á > a, à > a, â > a, and so on; 我需要用单个元音转换带有装饰的元音,如下所示:á> a,à> a,a,等等; and also remove special characters as dots, hyphens, etc.
并删除点,连字符等特殊字符。
Thanks! 谢谢!
You can first use a Normalizer and then remove the undesired characters: 您可以先使用Normalizer ,然后删除不需要的字符:
String input = "Carmen López-Delina Santos";
String withoutAccent = Normalizer.normalize(input, Normalizer.Form.NFD);
String output = withoutAccent.replaceAll("[^a-zA-Z ]", "");
System.out.println(output); //prints Carmen LopezDelina Santos
Note that this may not work for all and any non-ascii letters in any language - if such a case is encountered the letter would be deleted. 请注意,这可能不适用于任何语言的所有和任何非ascii字母 - 如果遇到这种情况,该字母将被删除。 One such example is the Turkish
i
. 一个这样的例子是土耳其语
i
。
The alternative in that situation is probably to list all the possible letters and their replacement... 在这种情况下的替代方案可能是列出所有可能的字母及其替代品......
You can use this removeAccents method with a later replaceAll
with [^A-Za-z ]
: 您可以将此removeAccents方法与稍后的
replaceAll
与[^A-Za-z ]
:
public static String removeAccents(String text) {
return text == null ? null :
Normalizer.normalize(text, Form.NFD)
.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
}
The
Normalizer
decomposes the original characters into a combination of a base character and a diacritic sign (this could be multiple signs in different languages).Normalizer
将原始字符分解为基本字符和变音符号的组合(这可以是不同语言中的多个符号)。á
,é
andí
have the same sign:0301
for marking the'
accent.á
,é
和í
具有相同的符号:0301
用于标记'
重音符号。The
\\p{InCombiningDiacriticalMarks}+
regular expression will match all such diacritic codes and we will replace them with an empty string.\\p{InCombiningDiacriticalMarks}+
正则表达式将匹配所有这些变音符号代码,我们将用空字符串替换它们。
And in the caller: 在来电者:
String original = "Carmen López-Delina Santos";
String res = removeAccents(original).replaceAll("[^A-Za-z ]", "");
System.out.println(res);
See IDEONE demo 请参阅IDEONE演示
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.