简体   繁体   中英

Java Regex Quantifiers in String Split

The code:

String s = "a12ij";

System.out.println(Arrays.toString(s.split("\\d?")));

The output is [a, , , i, j], which confuses me. If the expression is greedy, shouldn't it try and match as much as possible, thereby splitting on each digit? I would assume that the output should be [a, , i, j] instead. Where is that extra empty character coming from?

The pattern you're using only matches one digit a time:

\d    match a digit [0-9]
 ?    matches between zero and one time (greedy)

Since you have more than one digit it's going to split on both of them individually. You can easily match more than one digit at a time more than a few different ways, here are a couple:

\d    match a digit [0-9]
+?    matches between one and unlimited times (lazy)

Or you could just do:

\d    match a digit [0-9]
 +    matches between one and unlimited times (greedy)

Which would likely be the closest to what I would think you would want, although it's unclear.

Explanation:

Since the token \\d is using the ? quantifier the regex engine is telling your split function to match a digit between zero and one time. So that must include all of your characters (zero), as well as each digit matched (once).

You can picture it something like this:

    a,1,2,i,j    // each character represents (zero) and is split
      | |
    a, , ,i,j    // digit 1 and 2 are each matched (once)

Digit 1 and 2 were matched but not captured — so they are tossed out, however, the comma still remains from the split, and is not removed basically producing two empty strings.


If you're specifically looking to have your result as a, ,i,j then I'll give you a hint. You'll want to ( capture the \\d igits as a group between one and unlimited times +) followed up by the greedy qualifier ? . I recommend visiting one of the popular regex sites that allows you to experiment with patterns and quantifiers; it's also a great way to learn and can teach you a lot!

The solution can be found here

The javadoc for split() is not clear on what happens when a pattern can match the empty string. My best guess here is the delimiters found by split() are what would be found by successive find() calls of a Matcher . The javadoc for find() says:

This method starts at the beginning of this matcher's region, or, if a previous invocation of the method was successful and the matcher has not since been reset, at the first character not matched by the previous match.

So if the string is "a12ij" and the pattern matches either a single digit or an empty string, then find() should find the following:

  • Empty string starting at position 0 (before a )
  • The string "1"
  • The string "2"
  • Empty string starting at position 3 (before i ). This is because "the first character not matched by the previous match" is the i .
  • Empty string starting at position 4 (before j ).
  • Empty string starting at position 5 (at the end of the string).

So if the matches found are the substrings denoted by the x , where an x under a blank means the match is an empty string:

  a   1   2   i   j
x     x   x x   x   x

Now if we look at the substrings between the x 's, they are "a" , "" , "" , "i" , "j" as you are seeing. (The substring before the first empty string is not returned, because the split() javadoc says "A zero-width match at the beginning however never produces such empty leading substring." [Note that this may be new behavior with Java 8.] Also, split() doesn't return empty trailing substrings.)

I'd have to look at the code for split() to confirm this behavior. But it makes sense looking at the Matcher javadoc and it is consistent with the behavior you're seeing.

I've confirmed from the source that split() does rely on Matcher and find() , except for an optimization for the common case of splitting on a one-known-character delimiter. 我已经从源代码确认split()确实依赖于Matcherfind() ,除了对一个已知字符分隔符进行拆分的常见情况的优化。 So that explains the behavior.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM