简体   繁体   English

replaceAll()删除所有带有这些异常的标点符号

[英]replaceAll() remove all the punctuation marks with these exceptions

I am trying to get a String[] using a .txt file, and I need to remove all punctuation with some exceptions. 我正在尝试使用.txt文件获取String [],除某些例外情况外,我需要删除所有标点符号。 Here is my code: 这是我的代码:

replaceAll("[^a-zA-Z ]", "");

exceptions: 1.hyphen(s) that are inside a word. 例外:1.单词内的连字符。 2.Get rid of the words that contain digits 3.Get rid of the words contains two punctuation at the end and the beginning 2.删除包含数字的单词3.删除包含结尾和开头的两个标点符号的单词

[^a-zA-Z ] is a character class. [^ a-zA-Z]是字符类。 This means that it will match only one character and in this case will match anything that is not az, AZ or a whitespace. 这意味着它将仅匹配一个字符,在这种情况下将匹配非az,AZ或空白的任何字符。

If you want to match words, you need to use character classes with quantifiers for example +. 如果要匹配单词,则需要将字符类与量词一起使用,例如+。 If you want to match different patterns you need to apply the or logical operator | 如果要匹配不同的模式,则需要应用或逻辑运算符| .

Knowing this, you may now match words that ends with one or more number or that has a number in the middle [^a-zA-Z ][0-9]+|[^a-zA-Z ]+[0-9] . 知道了这一点后,您现在可以匹配以一个或多个数字结尾或在中间[^a-zA-Z ][0-9]+|[^a-zA-Z ]+[0-9] I'll leave it to you as an exercise to apply it for your three cased since this sounds like a school assignment. 我将其留给您作为练习,以将其应用于您的三个案例,因为这听起来像是学校的作业。

I have very complicated regex but it works. 我有非常复杂的正则表达式,但它可以工作。

\S*\d+\S*|\p{Punct}{2,}\S*|\S*\p{Punct}{2,}|[\p{Punct}&&[^-]]+|(?<![a-z])\-(?![a-z])

Explanation: 说明:

Match this alternative «\S*\d+\S*»
   Match a single character that is NOT a “whitespace character” «\S*»
      Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
   Match a single character that is a “digit” «\d+»
      Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
   Match a single character that is NOT a “whitespace character” «\S*»
      Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Or match this alternative «\p{Punct}{2,}\S*»
   Match a character from the POSIX character class “punct” «\p{Punct}{2,}»
      Between 2 and unlimited times, as many times as possible, giving back as needed (greedy) «{2,}»
   Match a single character that is NOT a “whitespace character” «\S*»
      Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Or match this alternative «\S*\p{Punct}{2,}»
   Match a single character that is NOT a “whitespace character” «\S*»
      Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
   Match a character from the POSIX character class “punct” «\p{Punct}{2,}»
      Between 2 and unlimited times, as many times as possible, giving back as needed (greedy) «{2,}»
Or match this alternative «[\p{Punct}&&[^-]]+»
   Match a single character present in the list below «[\p{Punct}&&[^-]]+»
      Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
      A character from the POSIX character class “punct” «\p{Punct}»
      Except the literal character “-” «&&[^-]»
Or match this alternative «(?<![a-z])\-(?![a-z])»
   Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<![a-z])»
      Match a single character in the range between “a” and “z” «[a-z]»
   Match the character “-” literally «\-»
   Assert that it is impossible to match the regex below starting at this position (negative lookahead) «(?![a-z])»
      Match a single character in the range between “a” and “z” «[a-z]»

Example: 例:

String text ="a-b ab--- - ---a --- , ++++ ?%# $22 43 4zzv";

String rx = "(?i)\\S*\\d+\\S*|\\p{Punct}{2,}\\S*|\\S*\\p{Punct}{2,}|[\\p{Punct}&&[^-]]+|(?<![a-z])\\-(?![a-z])";

String result = text.replaceAll(rx, " ").trim();

System.out.println(result);

Code above will print: 上面的代码将打印:

a-b

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM