简体   繁体   中英

Java's .split() method corner cases: what are they?

ANSWER GIVEN, SEE BELOW -- morale: never calls .split() alone; if you want sane behaviour, always give it a length argument of -1. But not 0!

The javadoc for Pattern.split() states the following:

The array returned by this method contains each substring of the input sequence that is terminated by another subsequence that matches this pattern or is terminated by the end of the input sequence.

Witness this code:

private static final Pattern UNDERSCORE = Pattern.compile("_");

public static void main(final String... args)
{
    System.out.println(UNDERSCORE.split("_").length);
}

Now, refering to the javadoc, an array should contain substrings of the input which are either (quoting):

  • "terminated by another subsequence that matches this pattern": well, there is one -- the empty string right before the underscore (which UNDERSCORE obviously matches);
  • or "is terminated by the end of the input sequence": there is one too: the empty string right after the underscore.

Yet, the above code prints 0 . Why? Is this a known bug? ( imnsho yes, see below ) What are other cases where .split() does not obey its contract? ( again, see below )

THE ANSWER (right below this explanative text)

When using a Pattern , the single-argument .split() method is equivalent to calling the two-arguments method with 0 as an argument.

And this is where the bug lies. With an argument of 0, all empty strings from the end of the array "down to" the first non empty element are removed from the result .

If, prior to reading this, you didn't know what a braindead design decision was, now you know. And it is all the more dangerous that this is the default behaviour.

The solution is to always use the full form of the .split() method and give it a negative length argument. Here, -1 is chosen. And in this case, .split() behaves sanely:

private static final Pattern UNDERSCORE = Pattern.compile("_");

public static void main(final String... args)
{
    System.out.println(UNDERSCORE.split("_").length);
    System.out.println(UNDERSCORE.split("__").length);
    System.out.println(UNDERSCORE.split("_x_").length);
    System.out.println(UNDERSCORE.split("_", -1).length);
    System.out.println(UNDERSCORE.split("__", -1).length);
    System.out.println(UNDERSCORE.split("_x_", -1).length);
}

Output:

0 # BUG!
0 # BUG!
2 # BUG!
2 # OK
3 # OK
3 # OK

From the same documentation:

The limit parameter controls the number of times the pattern is applied and therefore affects the length of the resulting array.

If n [the limit] is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded.

The default value for the limit is indeed 0 :

 public String[] split(CharSequence input) 

...

This method works as if by invoking the two-argument split method with the given input sequence and a limit argument of zero.

Thus, the empty string is discarded.

If you want it use UNDERSCORE.split("_", -1) , or any other negative integer.


EDIT : To clear up confusion: with a negative limit, the returned array would, according to your reasoning, be this:

[ "" , "" ]

With a non-positive limit, all trailing empty strings are removed. The last element is an empty string, so it is removed. Then, you have:

[ "" ]

The last element is again an empty string, so it is removed as well.

In other words, trailing refers not to trailing in the initial string, but trailing in the final array.


See also:

"terminated by another subsequence that matches this pattern": well, there is one -- the empty string right before the underscore (which UNDERSCORE obviously matches);

No, it doesn't - the empty string before the pattern does not match '_'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM