简体   繁体   中英

Performance of StringTokenizer class vs. String.split method in Java

In my software I need to split string into words. I currently have more than 19,000,000 documents with more than 30 words each.

Which of the following two ways is the best way to do this (in terms of performance)?

StringTokenizer sTokenize = new StringTokenizer(s," ");
while (sTokenize.hasMoreTokens()) {

or

String[] splitS = s.split(" ");
for(int i =0; i < splitS.length; i++)

If your data already in a database you need to parse the string of words, I would suggest using indexOf repeatedly. Its many times faster than either solution.

However, getting the data from a database is still likely to much more expensive.

StringBuilder sb = new StringBuilder();
for (int i = 100000; i < 100000 + 60; i++)
    sb.append(i).append(' ');
String sample = sb.toString();

int runs = 100000;
for (int i = 0; i < 5; i++) {
    {
        long start = System.nanoTime();
        for (int r = 0; r < runs; r++) {
            StringTokenizer st = new StringTokenizer(sample);
            List<String> list = new ArrayList<String>();
            while (st.hasMoreTokens())
                list.add(st.nextToken());
        }
        long time = System.nanoTime() - start;
        System.out.printf("StringTokenizer took an average of %.1f us%n", time / runs / 1000.0);
    }
    {
        long start = System.nanoTime();
        Pattern spacePattern = Pattern.compile(" ");
        for (int r = 0; r < runs; r++) {
            List<String> list = Arrays.asList(spacePattern.split(sample, 0));
        }
        long time = System.nanoTime() - start;
        System.out.printf("Pattern.split took an average of %.1f us%n", time / runs / 1000.0);
    }
    {
        long start = System.nanoTime();
        for (int r = 0; r < runs; r++) {
            List<String> list = new ArrayList<String>();
            int pos = 0, end;
            while ((end = sample.indexOf(' ', pos)) >= 0) {
                list.add(sample.substring(pos, end));
                pos = end + 1;
            }
        }
        long time = System.nanoTime() - start;
        System.out.printf("indexOf loop took an average of %.1f us%n", time / runs / 1000.0);
    }
 }

prints

StringTokenizer took an average of 5.8 us
Pattern.split took an average of 4.8 us
indexOf loop took an average of 1.8 us
StringTokenizer took an average of 4.9 us
Pattern.split took an average of 3.7 us
indexOf loop took an average of 1.7 us
StringTokenizer took an average of 5.2 us
Pattern.split took an average of 3.9 us
indexOf loop took an average of 1.8 us
StringTokenizer took an average of 5.1 us
Pattern.split took an average of 4.1 us
indexOf loop took an average of 1.6 us
StringTokenizer took an average of 5.0 us
Pattern.split took an average of 3.8 us
indexOf loop took an average of 1.6 us

The cost of opening a file will be about 8 ms. As the files are so small, your cache may improve performance by a factor of 2-5x. Even so its going to spend ~10 hours opening files. The cost of using split vs StringTokenizer is far less than 0.01 ms each. To parse 19 million x 30 words * 8 letters per word should take about 10 seconds (at about 1 GB per 2 seconds)

If you want to improve performance, I suggest you have far less files. eg use a database. If you don't want to use an SQL database, I suggest using one of these http://nosql-database.org/

Split in Java 7 just calls indexOf for this input, see the source . Split should be very fast, close to repeated calls of indexOf.

The Java API specification recommends using split . See the documentation of StringTokenizer .

Another important thing, undocumented as far as I noticed, is that asking for the StringTokenizer to return the delimiters along with the tokenized string (by using the constructor StringTokenizer(String str, String delim, boolean returnDelims) ) also reduces processing time. So, if you're looking for performance, I would recommend using something like:

private static final String DELIM = "#";

public void splitIt(String input) {
    StringTokenizer st = new StringTokenizer(input, DELIM, true);
    while (st.hasMoreTokens()) {
        String next = getNext(st);
        System.out.println(next);
    }
}

private String getNext(StringTokenizer st){  
    String value = st.nextToken();
    if (DELIM.equals(value))  
        value = null;  
    else if (st.hasMoreTokens())  
        st.nextToken();  
    return value;  
}

Despite the overhead introduced by the getNext() method, that discards the delimiters for you, it's still 50% faster according to my benchmarks.

Use split.

StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method instead.

While running micro (and in this case, even nano) benchmarks, there is a lot that affects your results. JIT optimizations and garbage collection to name just a few.

In order to get meaningful results out of the micro benchmarks, check out the jmh library. It has excellent samples bundled on how to run good benchmarks.

What the 19,000,000 documents have to do there ? Do you have to split words in all the documents on a regular basis ? Or is it a one shoot problem?

If you display/request one document at a time, with only 30 word, this is a so tiny problem that any method would work.

If you have to process all documents at a time, with only 30 words, this is a so tiny problem that you are more likely to be IO bound anyway.

Regardless of its legacy status, I would expect StringTokenizer to be significantly quicker than String.split() for this task, because it doesn't use regular expressions: it just scans the input directly, much as you would yourself via indexOf() . In fact String.split() has to compile the regex every time you call it, so it isn't even as efficient as using a regular expression directly yourself.

这可能是使用1.6.0的合理基准测试

http://www.javamex.com/tutorials/regular_expressions/splitting_tokenisation_performance.shtml#.V6-CZvnhCM8

Performance wise StringTokeniser is way better than split. Check the code below,

在此输入图像描述

But according to Java docs its use is discouraged. Check Here

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM