简体   繁体   中英

regex, take out punctuation that is not part of a word inside a string

i have this code:

String s="  //wont won't won't ";
String[] w =  s.split("[\\s+\\/,\\.!_\\-?;:]++");

i don't the ' to be removed from won't as it is part of the word. help would be appreciated but //wont i do want // to be removed.

so my question is the following- how do I utilize regex in java to get a certain punctuation not to be removed if its part of a word like "won't" where we have ' , but at the same time keep this-

"[\\s+\\/,\\.!_\\-?;:]++"

You can use

String[] w = s.split("[\\s+/,.!_\\-?;:]+|\\B'|'\\B");

See the regex demo . Details :

  • [\\s+/,.!_\\-?;:]+ - one or more whitespaces, + , / , , , . , ! , _ , - , ? , ; or :
  • | - or
  • \\B' - ' that is at the start of string or immediately preceded with a non-word char
  • | - or
  • '\\B - ' that is at the end of string or immediately followed with a non-word char.

See the Java demo :

String s ="  //wont won't won't ";
String[] w = s.split("[\\s+/,.!_\\-?;:]+|\\B'|'\\B");
System.out.println(Arrays.toString(w));
// => [, wont, won't, won't]

You may get rid of the empty entries at the start if you remove all matches at the start of the string first:

String regex = "[\\s+/,.!_\\-?;:]+|\\B'|'\\B";
String[] w2 = s.replaceFirst("^(?:"+regex+")+", "").split(regex);
System.out.println(Arrays.toString(w2));
// => [wont, won't, won't]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM