简体   繁体   English

从Java中的子串中有效地解析整数

[英]Efficient parsing of integers from substrings in Java

AFAIK there is no efficient way in the standard Java libraries to parse an integer from a substring without actually newing up a new string containing the substring. AFAIK在标准Java库中没有有效的方法来解析子字符串中的整数而不实际新建包含子字符串的新字符串。

I'm in a situation where I'm parsing millions of integers from strings, and I don't particularly want to create new strings for every substring. 我正处于从字符串中解析数百万个整数的情况,我并不特别想为每个子字符串创建新的字符串。 The copying is overhead I don't need. 复制是我不需要的开销。

Given a string s, I'd like a method like: 给定一个字符串s,我想要一个像这样的方法:

parseInteger(s, startOffset, endOffset)

with semantics like: 语义如下:

Integer.parseInt(s.substring(startOffset, endOffset))

Now, I know I can write this reasonably trivially like this: 现在,我知道我可以像这样合理地写这个:

public static int parse(String s, int start, int end) {
    long result = 0;
    boolean foundMinus = false;

    while (start < end) {
        char ch = s.charAt(start);
        if (ch == ' ')
            /* ok */;
        else if (ch == '-') {
            if (foundMinus)
                throw new NumberFormatException();
            foundMinus = true;
        } else if (ch < '0' || ch > '9')
            throw new NumberFormatException();
        else
            break;
        ++start;
    }

    if (start == end)
        throw new NumberFormatException();

    while (start < end) {
        char ch = s.charAt(start);
        if (ch < '0' || ch > '9')
            break;
        result = result * 10 + (int) ch - (int) '0';
        ++start;
    }

    while (start < end) {
        char ch = s.charAt(start);
        if (ch != ' ')
            throw new NumberFormatException();
        ++start;
    }
    if (foundMinus)
        result *= -1;
    if (result < Integer.MIN_VALUE || result > Integer.MAX_VALUE)
        throw new NumberFormatException();
    return (int) result;
}

But that's not the point. 但那不是重点。 I'd rather get this from a tested, supported third-party library. 我宁愿从经过测试,支持的第三方库中获取此信息。 For example, parsing longs and dealing properly with Long.MIN_VALUE is slightly subtle, and I cheat above by parsing ints into longs. 例如,解析long并使用Long.MIN_VALUE正确处理有点微妙,我通过将int解析为long来欺骗。 And the above still has an overflow issue if the parsed integer is bigger than Long.MAX_VALUE. 如果解析的整数大于Long.MAX_VALUE,则上面仍然存在溢出问题。

Is there any such library? 有没有这样的图书馆?

My searching has turned up little. 我的搜索结果很少。

Have you profiled your app? 你有没有想过你的应用程序? Have you located the source of your problem? 您找到了问题的根源吗?

Since Strings are immutable, there is a good chance that very little memory is requierd and very few operations are performed to create a substring. 由于Strings是不可变的,所以很有可能需要很少的内存,并且很少有操作来创建子字符串。

Unless you are really experiencing problems with memory, garbage collection, etc. just use the substring method. 除非你真的遇到内存,垃圾收集等问题,否则只需使用substring方法。 Don't seek complex solutions to problems you do not have. 不要为你没有的问题寻求复杂的解决方案。

Besides: if you implement something on your own, you may lose more than you gain in terms of efficiency. 此外:如果你自己实施某些东西,你可能会损失超过你的效率。 Your code does a lot and is quite complex - as for the default implementation, however, you may be quite certain that it is relatively fast. 您的代码做了很多而且非常复杂 - 但是对于默认实现,您可能非常确定它相对较快。 And error-free. 并且没有错误。

I could not resist to measure the improvement of your method: 我无法抗拒衡量你的方法的改进:

package test;

public class TestIntParse {

    static final int MAX_NUMBERS = 10000000;
    static final int MAX_ITERATIONS = 100;

    public static void main(String[] args) {
        long timeAvoidNewStrings = 0;
        long timeCreateNewStrings = 0;

        for (int i = 0; i < MAX_ITERATIONS; i++) {
            timeAvoidNewStrings += test(true);
            timeCreateNewStrings += test(false);
        }

        System.out.println("Average time method 'AVOID new strings': " + (timeAvoidNewStrings / MAX_ITERATIONS) + " ms");
        System.out.println("Average time method 'CREATE new strings': " + (timeCreateNewStrings / MAX_ITERATIONS) + " ms");
    }

    static long test(boolean avoidStringCreation) {
        long start = System.currentTimeMillis();

        for (int i = 0; i < MAX_NUMBERS; i++) {
            String value = Integer.toString((int) Math.random() * 100000);
            int intValue = avoidStringCreation ? parse(value, 0, value.length()) : parse2(value, 0, value.length());
            String value2 = Integer.toString(intValue);
            if (!value2.equals(value)) {
                System.err.println("Error at iteration " + i + (avoidStringCreation ? " without" : " with") + " string creation: " + value + " != " + value2);
            }
        }

        return System.currentTimeMillis() - start;
    }

    public static int parse2(String s, int start, int end) {
        return Integer.valueOf(s.substring(start, end));
    }

    public static int parse(String s, int start, int end) {
        long result = 0;
        boolean foundMinus = false;

        while (start < end) {
            char ch = s.charAt(start);
            if (ch == ' ')
                /* ok */;
            else if (ch == '-') {
                if (foundMinus)
                    throw new NumberFormatException();
                foundMinus = true;
            } else if (ch < '0' || ch > '9')
                throw new NumberFormatException();
            else
                break;
            ++start;
        }

        if (start == end)
            throw new NumberFormatException();

        while (start < end) {
            char ch = s.charAt(start);
            if (ch < '0' || ch > '9')
                break;
            result = result * 10 + ch - '0';
            ++start;
        }

        while (start < end) {
            char ch = s.charAt(start);
            if (ch != ' ')
                throw new NumberFormatException();
            ++start;
        }
        if (foundMinus)
            result *= -1;
        if (result < Integer.MIN_VALUE || result > Integer.MAX_VALUE)
            throw new NumberFormatException();
        return (int) result;
    }

}

The results: 结果:

Average time method 'AVOID new strings': 432 ms
Average time method 'CREATE new strings': 500 ms

Your method is roughly 14% more efficient in time and supposedly in memory, though quite more complex (and error prone). 你的方法在时间上大约高出14%,据说在内存中,虽然相当复杂(并且容易出错)。 From my point of view your approach does not pay off, though might do in your case. 从我的观点来看,你的方法并没有得到回报,尽管你的情况可能会有所回报。

Don't worry too much about the objects if you do not experience actual performance problems. 如果您没有遇到实际的性能问题,请不要过于担心对象。 Use a current JVM, there are permanent improvements in regard to performance and memory overhead. 使用当前的JVM,在性能和内存开销方面有永久性的改进。

You can have a look at the "ByteString" from Google protocol buffers if you want to have a substring sharing the underlying string: 如果您想要共享基础字符串的子字符串,您可以查看Google协议缓冲区中的“ByteString”:

https://developers.google.com/protocol-buffers/docs/reference/java/com/google/protobuf/ByteString#substring%28int,%20int%29 https://developers.google.com/protocol-buffers/docs/reference/java/com/google/protobuf/ByteString#substring%28int,%20int%29

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM