文本文件解析器的Java字符串内存性能改进

Question

I'm trying to optimize my memory usage when dealing with many (millions) of strings. 在处理许多（数百万）字符串时，我正在尝试优化内存使用率。 I have a file of ~1.5 million lines that I'm reading through that contains decimal numbers in columns. 我正在读取的文件约为150万行，其中包含十进制数的列。 For example, a file may look like 例如，一个文件可能看起来像

16916576.643 4 -12312674.246 4        39.785 4  16916584.123 3  -5937726.325 3
    36.794 3
16399226.418 6  -4129008.232 6        43.280 6  16399225.374 4  -1891751.787 4
    39.885 4
12415561.671 9 -33057782.339 9        52.412 9  12415567.518 8 -25595925.487 8
    49.950 8
15523362.628 5 -12597312.619 5        40.579 5  15523369.553 5  -9739990.371 5
    42.003 5
12369614.129 8 -28797729.913 8        50.068 8         0.000           0.000  
     0.000  
....

Currently I'm using String.split("\\\\s+") to separate these numbers, then calling Double.parseDouble() on each one of the parts of the String[] , which looks something like: 目前，我正在使用String.split("\\\\s+")分隔这些数字，然后在String[]每一部分上调用Double.parseDouble() ，如下所示：

String[] data = line.split("\\s+");
double firstValue = Double.parseDouble(data[0]);
double secondValue = Double.parseDouble(data[1]);
double thirdValue = Double.parseDouble(data[2]);

This ends up creating a lot of String objects. 最终创建了很多String对象。 I also may have whitespace at the beginning or end of the line so I have to call trim() on the line before I split it, which also creates another String object. 我还可能在行的开头或结尾处有空格，因此在拆分之前必须在行上调用trim() ，这还会创建另一个String对象。 The garbage collector disposes of these String objects, which results in slowdowns. 垃圾收集器处理这些String对象，这会导致速度降低。 Are there more memory efficient constructs in Java to do this? Java中是否有更多的内存有效构造可以做到这一点？ I was thinking about using a char[] instead of a String but I'm not sure whether there would be a substantial improvement from that. 我当时正在考虑使用char[]而不是String但是我不确定是否会对此进行实质性的改进。

Answer 1

If you are really certain that this is a severe bottleneck you could always parse your string directly into Double s. 如果您确实确定这是一个严重的瓶颈 ，则可以随时将字符串直接解析为Double 。

// Keep track of my state.
private static class AsDoublesState {

    // null means no number waiting for add.
    Double d = null;
    // null means not seen '.' yet.
    Double div = null;
    // Is it negative.
    boolean neg = false;

    void consume(List<Double> doubles, char ch) {
        // Digit?
        if ('0' <= ch && ch <= '9') {
            double v = ch - '0';
            if (d == null) {
                d = v;
            } else {
                d = d * 10 + v;
            }
            // Count digits after the dot.
            if (div != null) {
                div *= 10;
            }
        } else if (ch == '.') {
            // Decimal point - start tracking how much to divide by.
            div = 1.0;
        } else if (ch == '-') {
            // Negate!
            neg = true;
        } else {
            // Everything else completes the number.
            if (d != null) {
                if (neg) {
                    d = -d;
                }
                if (div != null) {
                    doubles.add(d / div);
                } else {
                    doubles.add(d);
                }
                // Clear down.
                d = null;
                div = null;
                neg = false;
            }
        }
    }
}

private static List<Double> asDoubles(String s) {
    // Grow the list.
    List<Double> doubles = new ArrayList<>();
    // Track my state.
    AsDoublesState state = new AsDoublesState();

    for (int i = 0; i < s.length(); i++) {
        state.consume(doubles, s.charAt(i));
    }
    // Pretend a space on the end to flush an remaining number.
    state.consume(doubles, ' ');
    return doubles;
}

public void test() {
    String s = "16916576.643 4 -12312674.246 4        39.785 4  16916584.123 3  -5937726.325 3    36.794 3";
    List<Double> doubles = asDoubles(s);
    System.out.println(doubles);
}

This will break badly if given bad data. 如果给出错误的数据，这将严重破坏。 Eg 123--56...392.86 would be a perfectly valid number to it, and 6.0221413e+23 would be two numbers. 例如123--56...392.86将是一个完全有效的数字，而6.0221413e+23将是两个数字。

Here's an improved State using AtomicDouble to avoid creating all of those Double objects`. 这是使用AtomicDouble的改进State ，可避免创建所有这些Double对象。

// Keep track of my state.
private static class AsDoublesState {

    // Mutable double value.
    AtomicDouble d = new AtomicDouble();
    // Mutable double value.
    AtomicDouble div = new AtomicDouble();
    // Is there a number.
    boolean number = false;
    // Is it negative.
    boolean negative = false;

    void consume(List<Double> doubles, char ch) {
        // Digit?
        if ('0' <= ch && ch <= '9') {
            double v = ch - '0';
            d.set(d.get() * 10 + v);
            number = true;
            // Count digits after the dot.
            div.set(div.get() * 10);
        } else if (ch == '.') {
            // Decimal point - start tracking how much to divide by.
            div.set(1.0);
        } else if (ch == '-') {
            // Negate!
            negative = true;
        } else {
            // Everything else completes the number.
            if (number) {
                double v = d.get();
                if (negative) {
                    v = -v;
                }
                if (div.get() != 0) {
                    v = v / div.get();
                }
                doubles.add(v);
                // Clear down.
                d.set(0);
                div.set(0);
                number = false;
                negative = false;
            }
        }
    }
}

Answer 2

Try using Pattern and Matcher to split the string with a compiled regular expression: 尝试使用Pattern和Matcher用已编译的正则表达式拆分字符串：

double[][] out = new double[2][2];
String[] data = new String[2];
data[0] = "1 2";
data[1] = "3 2";

Pattern pat = Pattern.compile("\\s*(\\d+\\.?\\d*)?\\s+?(\\d+\\.?\\d*)?\\s*");
Matcher mat = pat.matcher(data[0]);
mat.find();

out[0][0] = Double.parseDouble(mat.group(1));
out[0][1] = Double.parseDouble(mat.group(2));

mat = pat.matcher(data[1]);
mat.find();
out[1][0] = Double.parseDouble(mat.group(1));
out[1][1] = Double.parseDouble(mat.group(2));

Answer 3

We have similar problem in our application where lot of strings getting created and we did few things which help us fixing the issue. 在我们的应用程序中，我们遇到了类似的问题，即创建了很多字符串，而我们所做的事情很少，可以帮助我们解决问题。

Give more memory to Java if available eg -Xmx2G for 2gb. 如果可用，则为Java提供更多内存，例如-Xmx2G for 2gb。
If you're on 32 bit JVM then you only assign up to 4 gb - theoratical limit. 如果您使用的是32位JVM，则最多只能分配4 GB（理论上的限制）。 So move to 64 bit. 因此移至64位。
Profile your application 分析您的应用程序

You need to do it step by step: 您需要逐步进行：

Start with visualvm (click here for detail) and measure how many String, Double objects are getting created, size, time etc. 从visualvm开始（单击此处了解详细信息），然后测量要创建多少个String，Double对象，大小，时间等。
Use one of the Flyweight pattern and intern string objects. 使用Flyweight模式和内部字符串对象之一。 Guava library has Interner . 番石榴图书馆有Interner 。 Even, you can do even double also. 甚至，您甚至可以做两倍。 This will avoid duplicate and cache the object using weak references, eg here 这样可以避免重复并使用弱引用来缓存对象，例如此处
Interner<String> interner = Interners.newWeakInterner(); String a = interner.intern(getStringFromCsv()); String b = interner.intern(getStringFromCsv());

Copied from here 从这里复制

Profile your application again. 再次分析您的应用程序。

You can also use scanner to read double from file, your code will be cleaner and scanner also cache double values as it uses Double.ValueOf . 您还可以使用扫描仪从文件中读取double值，因为使用Double.ValueOf可以使代码更清晰，并且扫描仪还可以缓存double值。

Here's the code 这是代码

File file = new File("double_file.txt");
        Scanner scan = new Scanner(file);
        while(scan.hasNextDouble()) {
            System.out.println( scan.nextDouble() );
        }
        scan.close();

You can use this code and see if there is any GAIN in the performance or not. 您可以使用此代码查看演奏中是否有增益。

Answer 4

You don't have to produce any garbage at all: 您根本不需要产生任何垃圾：

Use BufferedInputStream in order to avoid char[] and String creation. 使用BufferedInputStream以避免生成char[]和String 。 There are no non-ASCII characters, so you deal with bytes directly. 没有非ASCII字符，因此您可以直接处理字节。
Write a parser similar to this solution , but avoiding any garbage. 编写类似于此解决方案的解析器，但避免任何垃圾。
Let your class provide a method like double nextDouble() , which reads characters until a next double is assembled. 让您的类提供一种类似double nextDouble()的方法，该方法读取字符，直到下一个double被组装为止。
If you need a line-wise processing, watch for \\n (ignore \\r as it's just a needless addendum to \\n ; lone \\r used to be used as line separator long time ago). 如果需要逐行处理，请注意\\n （忽略\\r因为它只是\\n的不必要的附录；很久以前，孤独的\\r用作行分隔符）。

文本文件解析器的Java字符串内存性能改进

问题描述

4 个解决方案

解决方案1
3 已采纳 2015-07-23 16:02:26

解决方案2
1 2015-07-23 15:35:27

解决方案3
1 2015-07-23 15:41:11

解决方案4
1 2015-07-23 17:11:16

文本文件解析器的Java字符串内存性能改进

问题描述

4 个解决方案

解决方案1 3 已采纳 2015-07-23 16:02:26

解决方案2 1 2015-07-23 15:35:27

解决方案3 1 2015-07-23 15:41:11

解决方案4 1 2015-07-23 17:11:16

解决方案1
3 已采纳 2015-07-23 16:02:26

解决方案2
1 2015-07-23 15:35:27

解决方案3
1 2015-07-23 15:41:11

解决方案4
1 2015-07-23 17:11:16