简体   繁体   中英

Split a string and separate by punctuation and whitespace

I have some strings, for example: I: am a string, with "punctuation". I want to split the string like:

["I", ":", "am", "a", "string", ",", "with", "\"", "punctuation", "\"", "."]

I tried text.split("[\\\\p{Punct}\\\\s]+") but the result is I, am, a, string, with, punctuation ...

I found this solution but Java doesn't allow me to split by \\w .

Use this regex:

"\\s+|(?=\\p{Punct})|(?<=\\p{Punct})"

The result on your string:

["I", ":", "am", "a", "string", ",", "with", "", "\"", "punctuation", "\"", "."]

Unfortunately, there is an extra element, the "" after the with. These extra elements only occur (and always occur) when there is a punctation character after a whitespace character, so this could be fixed by doing myString.replaceAll("\\\\s+(?=\\\\p{Punct})", "").split(regex); instead of myString.split(regex); (ie strip out the whitespace before splitting)

How this works:

  • \\\\s+ splits on a group of whitespace, so if the characters are whitespace characters, we will remove those characters and split at that location. (note: I am assuming that a string of hello world should result in ["hello", "world"] rather than ["hello", "", "world"] )
  • (?=\\\\p{Punct}) is a lookahead that splits if the next character is a punctuation character, but it doesn't remove the character.
  • (?<=\\\\p{Punct}) is a lookbehind that splits if the last character is a punctuation character.

EDIT:

In response to your comment , this regex should allow punctuation within words:

"\\s+|(?=\\W\\p{Punct}|\\p{Punct}\\W)|(?<=\\W\\p{Punct}|\\p{Punct}\\W})"

For this one, you don't need to use the replaceAll , simply do myString.split(regex) .

How it works:

This regex is very similar, but the lookarounds changed. \\\\W\\\\p{Punct} matches a non-word character followed by a punctuation character. \\\\p{Punct}\\\\W matches a punctuation character followed by a non-word character. So each lookaround matches iff there is a punctuation character which is not in the middle of a word.

Or try this, collect in an ArrayList:

    String s = "I: am a string, with \"punctuation\".";
    Pattern pat = Pattern.compile( "\\w+|\\S" );

    Matcher mat = pat.matcher( s );
    while( mat.find() ){
        System.out.print( mat.group() +  "/" );
    }
    System.out.println();

Output:

 I/:/am/a/string/,/with/"/punctuation/"/./

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM