简体   繁体   English

正则表达式要删除所有具有通用语言支持的非字母数字字符吗?

[英]Regex to remove all non-Alphanumeric characters with universal language support?

I would like to use Pattern's compile method to do this. 我想使用Pattern的compile方法来做到这一点。 Such as

String text = "Where? What is that, an animal? No! It is a plane.";
Pattern p = new Pattern("*some regex here*");
String delim = p.matcher(text).replaceAll("");

What is the regex that can do what I'm trying to accomplish? 能做我想完成的事情的正则表达式是什么?

Example strings: 字符串示例:

English 英语

Input: "Where? What is that, an animal? No! It is a plane."
Output: "Where What is that an animal No It is a plane"

Spanish 西班牙文

Input: "¿Dónde? ¿Qué es eso, un animal? ¡No! Es un avión."
Output: "Dónde Qué es eso un animal No Es un avión"

Portuguese 葡萄牙语

Input: "Onde? O que é isso, um animal? Não! É um avião."
Output: "Onde O que é isso um animal Não É um avião"

Hopefully the examples make it clear what I'm trying to accomplish. 希望这些示例可以清楚说明我要完成的工作。 Thanks all! 谢谢大家!

I am not an expert in all the languages of the world, however, your requirements could be met by doing this on a language specific basis: 我不是世界上所有语言的专家,但是,可以通过特定语言来满足您的要求:

Regex rgx = new Regex("[^a-zA-Z0-9 <put language specific characters to preserve here>]");
str = rgx.Replace(str, "");

I speak English and Korean, and can tell you that punctuation in Korean is identical to that used in English. 我说英语和韩语,可以告诉您,韩语的标点符号与英语的标点符号相同。 As indicated above, you can add characters that should be preserved and not considered punctuation for a particular language. 如上所述,您可以添加应保留的字符,而不是特定语言的标点符号。 For example, let's say the tilde should not be considered punctuation. 例如,假设不应该将波浪号视为标点符号。 Then use the regex: 然后使用正则表达式:

[^a-zA-Z0-9 ~]

The Java Pattern class, which is Java's implementation of regex, supports Unicode Categories , eg \\p{Lu} . Java Pattern类是Java的regex实现,支持Unicode类别 ,例如\\p{Lu} Since you want alphanumeric, that would be Categories L (Letter) and N (Number). 由于您需要字母数字,因此将是类别 L (字母)和N (数字)。

Since your example shows you also want to keep spaces, you need to include that. 由于您的示例显示您还希望保留空格,因此需要包括该空格。 Let's use the Predefined Character Class \\s , so you also get to keep newlines and tabs. 让我们使用预定义字符类 \\s ,这样您还可以保留换行符和制表符。

To find anything but the specified characters, use a Negation Character Class : [^abc] 要查找指定字符以外的任何字符,请使用否定字符类[^abc]

All-in-all, that means [^\\s\\p{L}\\p{N}] : 总而言之,这意味着[^\\s\\p{L}\\p{N}]

String output = input.replaceAll("[^\\s\\p{L}\\p{N}]+", "");
Where What is that an animal No It is a plane
Dónde Qué es eso un animal No Es un avión
Onde O que é isso um animal Não É um avião

Or see regex101.com for demo. 或访问regex101.com进行演示。


Of course, there are multiple ways to do it. 当然,有多种方法可以做到这一点。

You could alternatively use the POSIX Character Class \\p{Alnum} , and then enable UNICODE_CHARACTER_CLASS , using (?U) . 您也可以使用POSIX字符类 \\p{Alnum} ,然后使用(?U)启用UNICODE_CHARACTER_CLASS

String output = input.replaceAll("(?U)[^\\s\\p{Alnum}]+", "");
Where What is that an animal No It is a plane
Dónde Qué es eso un animal No Es un avión
Onde O que é isso um animal Não É um avião

Now, if you didn't want spaces, that could be simplified by using \\P{xx} instead: 现在,如果您不想使用空格,可以使用\\P{xx}来简化:

String output = input.replaceAll("(?U)\\P{Alnum}+", "");
WhereWhatisthatananimalNoItisaplane
DóndeQuéesesounanimalNoEsunavión
OndeOqueéissoumanimalNãoÉumavião

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM