简体   繁体   中英

How to split a string, keeping only certain delimiters?

I have a question similar to How to split a string, but also keep the delimiters? . How would I split a String using a regex, keeping some types of delimiters, but not others? Specifically, I want to keep the non-whitespace delimiters, but not the whitespace delimiters.

To make this concrete:

"a;b c"        | ["a", ";", "b", "c"]
"a; ; bb c ;d" | ["a", ";", ";", "bb", "c", ";", "d"]

Can this be done cleanly with a regex, and if so how?

Right now I'm working around this by splitting on the character to keep, and then again on the other one. I can stick with this approach if the regex cannot do so, or cannot do so cleanly:

Arrays.stream(input.split("((?<=;)|(?=;))"))
        .flatMap(s -> Arrays.stream(s.split("\\s+")))
        .filter(s -> !s.isEmpty())
        .toArray(String[]::new); // In practice, I would generally use .collect(Collectors.toList()) instead

I suggest to capture what you want instead of splitting using this simple pattern

([^; ]+|;)

Demo

You can do it this way:

System.out.println(String.join("-", "a; ; b c ;d".split("(?!\\G) *(?=;)|(?<=;) *| +")));

details:

(?!\\G)  # not contiguous to a previous match and not at the start of the string
[ ]*     # optional spaces
(?=;)    # followed by a ;
|    # OR
(?<=;)   # preceded by a ;
[ ]*     # optional spaces
|    # OR
[ ]+     # several spaces 

Feel free to change the literal space to \\\\s . To avoid an empty item (at the beginning of the resulting array when the string starts with a whitespace) , you need to trim the string first.

Obviously, without the constraint of splitting, @alphabravo way is the most simple.

I found a regex that works:

(\\s+)|((?<=;)(?=\\S)|(?<=\\S)(?=;))
public static void main(String argss[]){
    System.out.println(Arrays.toString("a; ; b c ;d"
        .split("(\\s+)|((?<=;)(?=\\S)|(?<=\\S)(?=;))")));
}

Will print out:

[a, ;, ;, b, c, ;, d]

您想要在空格上或在字母和非字母之间拆分:

str.split("\\s+|(?<=\\w)(?=\\W)|(?<=\\W)(?=\\w)");

After realizing Java doesn't support adding captured split char's to the
split array elements, thought I'd try a split solution without that
capability.

Basically there are only 4 permutations involving whitespace and the colon.
Finally, there is just the whitespace.

Here is the regex.

Raw: \\s+(?=;)|(?<=;)\\s+|(?<!\\s)(?=;)|(?<=;)(?!\\s)|\\s+

Stringed: "\\\\s+(?=;)|(?<=;)\\\\s+|(?<!\\\\s)(?=;)|(?<=;)(?!\\\\s)|\\\\s+"

And the expanded regex with permutation's explained.
Good luck!

    \s+                  # Required, suck up wsp before ;
    (?= ; )              # ;

 |                     # or,

    (?<= ; )             # ;
    \s+                  # Required, suck up wsp after ;

 |                     # or,

    (?<! \s )            # No wsp before ;
    (?= ; )              # ;

 |                     # or,

    (?<= ; )             # ;
    (?! \s )             # No wsp after ;

 |                     # or,

    \s+                  # Required wsp

Edit

To stop a split on whitespace at BOS, use this regex.

Raw: \\s+(?=;)|(?<=;)\\s+|(?<!\\s)(?=;)|(?<=;)(?!\\s)|(?<!^)(?<!\\s)\\s+

Stringed: "\\\\s+(?=;)|(?<=;)\\\\s+|(?<!\\\\s)(?=;)|(?<=;)(?!\\\\s)|(?<!^)(?<!\\\\s)\\\\s+"

Explained:

    \s+                  # Required, suck up wsp before ;
    (?= ; )              # ;

 |                     # or,

    (?<= ; )             # ;
    \s+                  # Required, suck up wsp after ;

 |                     # or,

    (?<! \s )            # No wsp before ;
    (?= ; )              # ;

 |                     # or,

    (?<= ; )             # ;
    (?! \s )             # No wsp after ;

 |                     # or,

    (?<! ^ )             # No split of wsp at BOS   
    (?<! \s )
    \s+                  # Required wsp

Borrowing @CasimiretHippolyte \\G trick you may want to split on

\\s+|(?!\\G)()

Note: no delimiters are specified.

Update

Based on avoiding split on very first spaces:

(?m)(?<!^|\\s)(\\s+|)(?!$)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM