简体   繁体   English

如何过滤Java String以仅获取字母字符?

[英]How to filter a Java String to get only alphabet characters?

I'm generating a XML file to make payments and I have a constraint for user's full names. 我正在生成一个XML文件来进行付款,我对用户的全名有约束。 That param only accept alphabet characters (a-ZAZ) + whitespaces to separe names and surnames. 那个参数只接受字母字符(a-ZAZ)+空格来分隔姓名和姓氏。

I'm not able to filter this in a easy way, how can I build a regular expression or a filter to get my desireable output? 我无法以简单的方式过滤这个,我如何构建正则表达式或过滤器以获得我想要的输出?

Example: 例:

'Carmen López-Delina Santos' must be 'Carmen LopezDelina Santos' 'Carmen López-Delina Santos' 'Carmen LopezDelina Santos' 'Carmen López-Delina Santos'必须是'Carmen LopezDelina Santos'

I need to transform vowels with decorations in single vowels as follows: á > a, à > a, â > a, and so on; 我需要用单个元音转换带有装饰的元音,如下所示:á> a,à> a,a,等等; and also remove special characters as dots, hyphens, etc. 并删除点,连字符等特殊字符。

Thanks! 谢谢!

You can first use a Normalizer and then remove the undesired characters: 您可以先使用Normalizer ,然后删除不需要的字符:

String input = "Carmen López-Delina Santos";
String withoutAccent = Normalizer.normalize(input, Normalizer.Form.NFD);
String output = withoutAccent.replaceAll("[^a-zA-Z ]", "");
System.out.println(output); //prints Carmen LopezDelina Santos

Note that this may not work for all and any non-ascii letters in any language - if such a case is encountered the letter would be deleted. 请注意,这可能不适用于任何语言的所有和任何非ascii字母 - 如果遇到这种情况,该字母将被删除。 One such example is the Turkish i . 一个这样的例子是土耳其语i

The alternative in that situation is probably to list all the possible letters and their replacement... 在这种情况下的替代方案可能是列出所有可能的字母及其替代品......

You can use this removeAccents method with a later replaceAll with [^A-Za-z ] : 您可以将此removeAccents方法与稍后的replaceAll[^A-Za-z ]

public static String removeAccents(String text) {
  return text == null ? null :
    Normalizer.normalize(text, Form.NFD)
        .replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
}

The Normalizer decomposes the original characters into a combination of a base character and a diacritic sign (this could be multiple signs in different languages). Normalizer将原始字符分解为基本字符和变音符号的组合(这可以是不同语言中的多个符号)。 á , é and í have the same sign: 0301 for marking the ' accent. áéí具有相同的符号: 0301用于标记'重音符号。

The \\p{InCombiningDiacriticalMarks}+ regular expression will match all such diacritic codes and we will replace them with an empty string. \\p{InCombiningDiacriticalMarks}+正则表达式将匹配所有这些变音符号代码,我们将用空字符串替换它们。

And in the caller: 在来电者:

String original = "Carmen López-Delina Santos";
String res = removeAccents(original).replaceAll("[^A-Za-z ]", "");
System.out.println(res);

See IDEONE demo 请参阅IDEONE演示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM