在Java中读取大数据文件时的巨大内存开销

Question

I'm doing deep learning neural net development, using the MNIST dataset for testing. 我正在使用MNIST数据集进行测试，以进行深度学习神经网络开发。 The training set is composed of 60,000 sequences, each with 784 double precision input values. 训练集由60,000个序列组成，每个序列具有784个双精度输入值。 The process of reading this data from the file into an array in java is somehow incurring an approximately 4GB memory overhead, which remains allocated throughout the run of the program. 从文件中读取数据到Java数组中的过程以某种方式导致大约4GB的内存开销，该开销在程序运行期间一直分配。 This overhead is in addition to the 60000*784*8 = 376MB which is allocated for the double precision array itself. 此开销是为双精度数组本身分配的60000 * 784 * 8 = 376MB的补充。 It seems likely that this overhead is occurring because java is storing a complete copy of the file in a addition to the numerical array, but perhaps this is Scanner overhead. 似乎发生了这种开销，因为java除了在数值数组之外还存储了文件的完整副本，但这也许是Scanner开销。

According to a source, reading the file as a stream avoids storing the entire file in memory. 消息人士称，将文件作为流读取会避免将整个文件存储在内存中。 However, I still have this problem with a stream read. 但是，流读取仍然存在此问题。 I'm using Java 8 with Intellij 2016.2.4. 我正在将Java 8与Intellij 2016.2.4一起使用。 This is the stream reading code: 这是流读取代码：

FileInputStream inputStream = null;
Scanner fileScan = null;
String line;
String[] numbersAsStrings;

totalTrainingSequenceArray = new double[60000][784];

try {
    inputStream = new FileInputStream(m_sequenceFile);
    fileScan = new Scanner(inputStream, "UTF-8");
    int sequenceNum = 0;
    line = fileScan.nextLine();//Read and discard the first line.
    while (fileScan.hasNextLine()) {
        line = fileScan.nextLine();
        numbersAsStrings = line.split("\\s+"); //Split the line into an array of strings using any whitespace delimiter.
        for (int inputPosition = 0; inputPosition < m_numInputs; inputPosition++) {
            totalTrainingSequenceArray[sequenceNum][inputPosition] = Double.parseDouble(numbersAsStrings[inputPosition]);
        }
        sequenceNum++;
    }
    if (fileScan.ioException() != null) {//Handle fileScan exception
        throw fileScan.ioException();
    }
} catch (IOException e) {//Handle the inputstream exception
    e.printStackTrace();
} finally {
    if (inputStream != null)  {
        try {
            inputStream.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
    if (fileScan != null) {
        fileScan.close();
    }
}

I've tried setting the stream and the scanner to null after the read and calling System.gc(), but this does nothing. 在读取并调用System.gc（）之后，我尝试将流和扫描仪设置为null，但这无济于事。 Is this a Scanner overhead issue? 这是扫描仪开销问题吗？ What would be the simplest way to read this large data file without incurring large permanent overhead? 在不产生大量永久性开销的情况下，读取此大数据文件的最简单方法是什么？ Thank you for any input. 感谢您的投入。

Answer 1

Your code works just fine. 您的代码工作正常。 380MB of heap will be actually used after a full GC. 完整的GC后将实际使用380MB的堆。

Java is eager to allocate memory to minimize GC overhead, you could limit the size of allocated memory by using -Xmx512m parameter or by using a different GC - eg -XX:+UseConcMarkSweepGC or by -XX:MaxHeapFreeRatio=40 . Java渴望分配内存以最大程度地减少GC开销，您可以使用-Xmx512m参数或使用其他GC（例如-XX:+UseConcMarkSweepGC或-XX:MaxHeapFreeRatio=40来限制分配的内存大小。

Answer 2

Define "overhead". 定义“开销”。 The VM uses the alloted heap to balance between garbage collection time and execution speed (there are some screws you can turn to influence its decisions). VM使用分配的堆在垃圾回收时间和执行速度之间进行平衡（您可以使用一些螺钉来影响其决策）。

The norm is the VM letting the heap fill until the gc threshold is reached, then collect whatever garbage can be collected, then contine execution (thats simplified a lot). 规范是VM让堆填满，直到达到gc阈值，然后收集可以收集的所有垃圾，然后进行contine执行（这大大简化了）。 This leads to a "sawtooth" pattern in heap usage (gradual filling, then sudden drop of heap usage). 这导致堆使用情况出现“锯齿状”模式（逐渐填充，然后堆使用率突然下降）。 This is completely normal for code that produces garbage at a rate. 对于以一定速度产生垃圾的代码，这是完全正常的。

The points you can influence are how high the "teeth" can build (by adjusting allowed heap and/or when the gc should kick in). 您可以影响的点是“牙齿”的高度（通过调整允许的堆和/或gc何时插入）。 The speed of garbage creation (how sharply heap usage rises) depends on the code executed, it can range anywhere from zero to the maximum attainable allocation rate. 垃圾创建的速度（堆使用的急剧增加）取决于所执行的代码，它的范围可以从零到最大可达到的分配速率。

Your reading code is of the type to create lot of small garbage objects: the line from the scanner, the parts you split the line into. 您的阅读代码属于创建大量小垃圾对象的类型：来自扫描仪的行，将行拆分成的部分。 If your heap is large enough, the entire file can be read without collecting any of that garbage (most likely thats the case with your 4GB heap setting). 如果您的堆足够大，则可以在不收集任何垃圾的情况下读取整个文件（最可能的情况是4GB堆设置）。

If you make the heap smaller, the VM will collect garbage sooner, reducing the memory usage (likewise you can play with the gc parameters to force collection at a smaller percentage of heap used). 如果使堆变小，VM将更快地收集垃圾，从而减少内存使用量（同样，您可以使用gc参数来强制以较小的堆百分比进行收集）。

Its unreasonable though to expect the code to run with just the amount of memory you calculated for your array. 尽管期望代码仅使用为数组计算的内存量来运行是不合理的。 What you see in the task manager is just the accumulation of all memory used by the VM. 您在任务管理器中看到的只是VM使用的所有内存的累积。 That includes stack, any resources needed for the JRE, native libraries and the heap. 其中包括堆栈，JRE所需的任何资源，本机库和堆。

Memory outside the heap can vary wildly, depending on how many threads, files and other resources your program uses. 堆外的内存可能千差万别，具体取决于程序使用多少线程，文件和其他资源。 As a very rough rule of thumb, at least 20-50 MB are used by the JRE itself, even for just running something simple like a "Hello world". 作为一个非常粗略的经验法则，JRE本身至少要使用20-50 MB，即使只是运行诸如“ Hello world”之类的简单操作。

The problem with VM tuning, regardless if you just adjust heap size or fine tune gc parameters, is that it has to be redone whenever the problem set changes (eg you could probably get away with -Xmx512m for your current file, but you would need to adjust the value for the next file). VM调整的问题，无论您只是调整堆大小还是微调gc参数，都是在问题集更改时必须重做（例如，对于当前文件，您可能不使用-Xmx512m，但是您需要调整下一个文件的值）。

Alternately, you could attempt to reduce the amount of garbage created, ideally to zero. 或者，您可以尝试减少创建的垃圾量，理想情况下为零。 Instead of the scanner, reading line by line, you could read character by character and do the parsing with a state machine. 您可以逐个字符地读取字符并使用状态机进行解析，而不是逐行读取扫描器。 This will greatly reduce garbage creation, but make the code much more complex . 这将大大减少垃圾的产生，但会使代码复杂得多 。

In many cases the most "efficient" solution is simply not to worry about memory usage - the time spent optimizing VM parameters or code would probably be more efficiently spent by focusing on making progress with your program. 在许多情况下，最“有效”的解决方案就是完全不必担心内存使用情况-通过着重于程序的进度，优化VM参数或代码所花费的时间可能会更有效。 As long as "overhead" doesn't hinder you, why bother? 只要“开销”不会妨碍您，为什么还要打扰？

在Java中读取大数据文件时的巨大内存开销

问题描述

2 个解决方案

解决方案1
2 2017-01-03 21:13:13

解决方案2
1 2017-01-03 21:15:17

在Java中读取大数据文件时的巨大内存开销

问题描述

2 个解决方案

解决方案1 2 2017-01-03 21:13:13

解决方案2 1 2017-01-03 21:15:17

解决方案1
2 2017-01-03 21:13:13

解决方案2
1 2017-01-03 21:15:17