简体   繁体   中英

In Java how do I split a String by null char without using regex

I have code of the form

 String[] splitValues = s.split("\\u0000");

that is called alot, when I did profiling I saw that each call was a regex (Pattern) to be compiled and run this was causing a significant performance impact.

I can easily compile the pattern just once but then running split still takes up significant cpu

I then looked at code for String,split() and it does optimizations if just passed a single char or backslash char but it not working for me because I specify null as \ , but I cant see how else I can do it,

public String[] split(String regex, int limit) {
        /* fastpath if the regex is a
         (1)one-char String and this character is not one of the
            RegEx's meta characters ".$|()[{^?*+\\", or
         (2)two-char String and the first char is the backslash and
            the second is not the ascii digit or ascii letter.
         */
        char ch = 0;
        if (((regex.length() == 1 &&
             ".$|()[{^?*+\\".indexOf(ch = regex.charAt(0)) == -1) ||
             (regex.length() == 2 &&
              regex.charAt(0) == '\\' &&
              (((ch = regex.charAt(1))-'0')|('9'-ch)) < 0 &&
              ((ch-'a')|('z'-ch)) < 0 &&
              ((ch-'A')|('Z'-ch)) < 0)) &&
            (ch < Character.MIN_HIGH_SURROGATE ||
             ch > Character.MAX_LOW_SURROGATE))
        {

How can I split by null separator without need to use regular expression ?

The "simple" way would be to precompile the regex:

static final Pattern NULL_SEPARATOR = Pattern.compile("\\u0000");

Then just do the same as on the last line :

String[] parts = NULL_SEPARATOR.split(this, limit);

Or, you could add things to a List instead of building an array:

List<String> parts = new ArrayList<>();
for (int i = 0; i < input.length();) {
  int start = i;
  i = input.indexOf('\0', start);
  if (i < 0) i = input.length();

  parts.add(input.substring(start, i));

  if (i < input.length()) {
    ++i;
  }
}

Of course, this gives you a List<String> , rather than a String[] ; this may or may not work for you. This has the convenience of growing the collection for you, but you can do that yourself with a String[] too.

Depending on [profiling], you might want to consider pre-sizing the list, for example by iterating the characters looking for \\0 s as a first pass.

Replacing

String[] splitValues = s.split("\\u0000");

with

String[] splitValues = s.split("\0");

continues to work, but importantly allows String.split() to use its fastpath and so the split works without requiring the use of regular expressions.

What I am finding slightly confusing is why I had a \\\\ originally because doesn't that mean the \\ is treated as a literal backslash and therefore the u0000 would not be treated as unicode char ?

There is an Open Source java library MgntUtils that has a Utility that converts Strings to unicode sequence and vise versa:

result = "Hello World";
result = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence(result);
System.out.println(result);
result = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(result);
System.out.println(result);

The output of this code is:

\u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064
Hello World

So what I suggest is that you can write a simple code like this

public static final String DELIMITER = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString("\\u0000");
...
String[] splitValues = s.split(DELIMITER);

This will allow you to run split() method without regex as DELIMITER will hold the null symbol as a String .
The library can be found at Maven Central or at Github It comes as maven artifact and with sources and javadoc

Here is javadoc for the class StringUnicodeEncoderDecoder

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM