简体   繁体   English

Java Regex - 删除除换行符以外的非字母数字字符

[英]Java Regex - Remove Non-Alphanumeric characters except line breaks

I'm trying to remove all the non-alphanumeric characters from a String in Java but keep the carriage returns.我正在尝试从 Java 中的字符串中删除所有非字母数字字符,但保留回车符。 I have the following regular expression, but it keeps joining words before and after a line break.我有以下正则表达式,但它在换行前后一直连接单词。

[^\\p{Alnum}\\s]

How would I be able to preserve the line breaks or convert them into spaces so that I don't have words joining?我如何能够保留换行符或将它们转换为空格,以便我没有文字加入?

An example of this issue is shown below:此问题的示例如下所示:

Original Text原文

and refreshingly direct
when compared with the hand-waving of Swinburne.

After Replacement:更换后:

 and refreshingly directwhen compared with the hand-waving of Swinburne.

You may add these chars to the regex, not \\s , as \\s matches any whitespace:您可以将这些字符添加到正则表达式,而不是\\s ,因为\\s匹配任何空格:

String reg = "[^\\p{Alnum}\n\r]";

Or, you may use character class subtraction :或者,您可以使用字符类减法

String reg = "[\\P{Alnum}&&[^\n\r]]";

Here, \\P{Alnum} matches any non-alphanumeric and &&[^\\n\\r] prevents a LF and CR from matching.这里, \\P{Alnum}匹配任何非字母数字,而&&[^\\n\\r]阻止 LF 和 CR 匹配。

A Java test :一个Java 测试

String s = "&&& Text\r\nNew line".replaceAll("[^\\p{Alnum}\n\r]+", "");
System.out.println(s); 
// => Text
Newline

Note that there are more line break chars than LF and CR.请注意,换行符比 LF 和 CR 多。 In Java 8, \\R construct matches any style linebreak and it matches \ \ |\\[\ \ \ \ \…\
\
\\] .在 Java 8 中, \\R构造匹配任何样式的换行符并且它匹配\ \ |\\[\ \ \ \ \…\
\
\\]

So, to exclude matching any line breaks, you may use因此,要排除匹配任何换行符,您可以使用

String reg = "[^\\p{Alnum}\\u000A\\u000B\\u000C\\u000D\\u0085\\u2028\\u2029]+";

You can use this regex [^A-Za-z0-9\\\\n\\\\r] for example :您可以使用此正则表达式[^A-Za-z0-9\\\\n\\\\r]例如:

String result = str.replaceAll("[^a-zA-Z0-9\\n\\r]", "");

Example示例

Input输入

aaze03.aze1654aze987  */-a*azeaze\n hello *-*/zeaze+64\nqsdoi

Output输出

aaze03aze1654aze987aazeaze
hellozeaze64
qsdoi

I made a mistake with my code.我的代码出错了。 I was reading in a file line by line and building the String, but didn't add a space at the end of each line.我正在逐行读取文件并构建字符串,但没有在每行末尾添加空格。 Therefore there were no actual line breaks to replace.因此,没有实际的换行符可以替换。

That's a perfect case for Guava's CharMatcher :这是 Guava 的CharMatcher的完美案例:

String input = "and refreshingly direct\n\rwhen compared with the hand-waving of Swinburne.";
String output = CharMatcher.javaLetterOrDigit().or(CharMatcher.whitespace()).retainFrom(input);

Output will be:输出将是:

and refreshingly direct
when compared with the handwaving of Swinburne

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM