简体   繁体   中英

Efficient parsing of integers from substrings in Java

AFAIK there is no efficient way in the standard Java libraries to parse an integer from a substring without actually newing up a new string containing the substring.

I'm in a situation where I'm parsing millions of integers from strings, and I don't particularly want to create new strings for every substring. The copying is overhead I don't need.

Given a string s, I'd like a method like:

parseInteger(s, startOffset, endOffset)

with semantics like:

Integer.parseInt(s.substring(startOffset, endOffset))

Now, I know I can write this reasonably trivially like this:

public static int parse(String s, int start, int end) {
    long result = 0;
    boolean foundMinus = false;

    while (start < end) {
        char ch = s.charAt(start);
        if (ch == ' ')
            /* ok */;
        else if (ch == '-') {
            if (foundMinus)
                throw new NumberFormatException();
            foundMinus = true;
        } else if (ch < '0' || ch > '9')
            throw new NumberFormatException();
        else
            break;
        ++start;
    }

    if (start == end)
        throw new NumberFormatException();

    while (start < end) {
        char ch = s.charAt(start);
        if (ch < '0' || ch > '9')
            break;
        result = result * 10 + (int) ch - (int) '0';
        ++start;
    }

    while (start < end) {
        char ch = s.charAt(start);
        if (ch != ' ')
            throw new NumberFormatException();
        ++start;
    }
    if (foundMinus)
        result *= -1;
    if (result < Integer.MIN_VALUE || result > Integer.MAX_VALUE)
        throw new NumberFormatException();
    return (int) result;
}

But that's not the point. I'd rather get this from a tested, supported third-party library. For example, parsing longs and dealing properly with Long.MIN_VALUE is slightly subtle, and I cheat above by parsing ints into longs. And the above still has an overflow issue if the parsed integer is bigger than Long.MAX_VALUE.

Is there any such library?

My searching has turned up little.

Have you profiled your app? Have you located the source of your problem?

Since Strings are immutable, there is a good chance that very little memory is requierd and very few operations are performed to create a substring.

Unless you are really experiencing problems with memory, garbage collection, etc. just use the substring method. Don't seek complex solutions to problems you do not have.

Besides: if you implement something on your own, you may lose more than you gain in terms of efficiency. Your code does a lot and is quite complex - as for the default implementation, however, you may be quite certain that it is relatively fast. And error-free.

I could not resist to measure the improvement of your method:

package test;

public class TestIntParse {

    static final int MAX_NUMBERS = 10000000;
    static final int MAX_ITERATIONS = 100;

    public static void main(String[] args) {
        long timeAvoidNewStrings = 0;
        long timeCreateNewStrings = 0;

        for (int i = 0; i < MAX_ITERATIONS; i++) {
            timeAvoidNewStrings += test(true);
            timeCreateNewStrings += test(false);
        }

        System.out.println("Average time method 'AVOID new strings': " + (timeAvoidNewStrings / MAX_ITERATIONS) + " ms");
        System.out.println("Average time method 'CREATE new strings': " + (timeCreateNewStrings / MAX_ITERATIONS) + " ms");
    }

    static long test(boolean avoidStringCreation) {
        long start = System.currentTimeMillis();

        for (int i = 0; i < MAX_NUMBERS; i++) {
            String value = Integer.toString((int) Math.random() * 100000);
            int intValue = avoidStringCreation ? parse(value, 0, value.length()) : parse2(value, 0, value.length());
            String value2 = Integer.toString(intValue);
            if (!value2.equals(value)) {
                System.err.println("Error at iteration " + i + (avoidStringCreation ? " without" : " with") + " string creation: " + value + " != " + value2);
            }
        }

        return System.currentTimeMillis() - start;
    }

    public static int parse2(String s, int start, int end) {
        return Integer.valueOf(s.substring(start, end));
    }

    public static int parse(String s, int start, int end) {
        long result = 0;
        boolean foundMinus = false;

        while (start < end) {
            char ch = s.charAt(start);
            if (ch == ' ')
                /* ok */;
            else if (ch == '-') {
                if (foundMinus)
                    throw new NumberFormatException();
                foundMinus = true;
            } else if (ch < '0' || ch > '9')
                throw new NumberFormatException();
            else
                break;
            ++start;
        }

        if (start == end)
            throw new NumberFormatException();

        while (start < end) {
            char ch = s.charAt(start);
            if (ch < '0' || ch > '9')
                break;
            result = result * 10 + ch - '0';
            ++start;
        }

        while (start < end) {
            char ch = s.charAt(start);
            if (ch != ' ')
                throw new NumberFormatException();
            ++start;
        }
        if (foundMinus)
            result *= -1;
        if (result < Integer.MIN_VALUE || result > Integer.MAX_VALUE)
            throw new NumberFormatException();
        return (int) result;
    }

}

The results:

Average time method 'AVOID new strings': 432 ms
Average time method 'CREATE new strings': 500 ms

Your method is roughly 14% more efficient in time and supposedly in memory, though quite more complex (and error prone). From my point of view your approach does not pay off, though might do in your case.

Don't worry too much about the objects if you do not experience actual performance problems. Use a current JVM, there are permanent improvements in regard to performance and memory overhead.

You can have a look at the "ByteString" from Google protocol buffers if you want to have a substring sharing the underlying string:

https://developers.google.com/protocol-buffers/docs/reference/java/com/google/protobuf/ByteString#substring%28int,%20int%29

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM