简体   繁体   中英

Huge memory overhead when reading a large data file in java

I'm doing deep learning neural net development, using the MNIST dataset for testing. The training set is composed of 60,000 sequences, each with 784 double precision input values. The process of reading this data from the file into an array in java is somehow incurring an approximately 4GB memory overhead, which remains allocated throughout the run of the program. This overhead is in addition to the 60000*784*8 = 376MB which is allocated for the double precision array itself. It seems likely that this overhead is occurring because java is storing a complete copy of the file in a addition to the numerical array, but perhaps this is Scanner overhead.

According to a source, reading the file as a stream avoids storing the entire file in memory. However, I still have this problem with a stream read. I'm using Java 8 with Intellij 2016.2.4. This is the stream reading code:

FileInputStream inputStream = null;
Scanner fileScan = null;
String line;
String[] numbersAsStrings;

totalTrainingSequenceArray = new double[60000][784];

try {
    inputStream = new FileInputStream(m_sequenceFile);
    fileScan = new Scanner(inputStream, "UTF-8");
    int sequenceNum = 0;
    line = fileScan.nextLine();//Read and discard the first line.
    while (fileScan.hasNextLine()) {
        line = fileScan.nextLine();
        numbersAsStrings = line.split("\\s+"); //Split the line into an array of strings using any whitespace delimiter.
        for (int inputPosition = 0; inputPosition < m_numInputs; inputPosition++) {
            totalTrainingSequenceArray[sequenceNum][inputPosition] = Double.parseDouble(numbersAsStrings[inputPosition]);
        }
        sequenceNum++;
    }
    if (fileScan.ioException() != null) {//Handle fileScan exception
        throw fileScan.ioException();
    }
} catch (IOException e) {//Handle the inputstream exception
    e.printStackTrace();
} finally {
    if (inputStream != null)  {
        try {
            inputStream.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
    if (fileScan != null) {
        fileScan.close();
    }
}

I've tried setting the stream and the scanner to null after the read and calling System.gc(), but this does nothing. Is this a Scanner overhead issue? What would be the simplest way to read this large data file without incurring large permanent overhead? Thank you for any input.

Your code works just fine. 380MB of heap will be actually used after a full GC.

Java is eager to allocate memory to minimize GC overhead, you could limit the size of allocated memory by using -Xmx512m parameter or by using a different GC - eg -XX:+UseConcMarkSweepGC or by -XX:MaxHeapFreeRatio=40 .

Define "overhead". The VM uses the alloted heap to balance between garbage collection time and execution speed (there are some screws you can turn to influence its decisions).

The norm is the VM letting the heap fill until the gc threshold is reached, then collect whatever garbage can be collected, then contine execution (thats simplified a lot). This leads to a "sawtooth" pattern in heap usage (gradual filling, then sudden drop of heap usage). This is completely normal for code that produces garbage at a rate.

The points you can influence are how high the "teeth" can build (by adjusting allowed heap and/or when the gc should kick in). The speed of garbage creation (how sharply heap usage rises) depends on the code executed, it can range anywhere from zero to the maximum attainable allocation rate.

Your reading code is of the type to create lot of small garbage objects: the line from the scanner, the parts you split the line into. If your heap is large enough, the entire file can be read without collecting any of that garbage (most likely thats the case with your 4GB heap setting).

If you make the heap smaller, the VM will collect garbage sooner, reducing the memory usage (likewise you can play with the gc parameters to force collection at a smaller percentage of heap used).

Its unreasonable though to expect the code to run with just the amount of memory you calculated for your array. What you see in the task manager is just the accumulation of all memory used by the VM. That includes stack, any resources needed for the JRE, native libraries and the heap.

Memory outside the heap can vary wildly, depending on how many threads, files and other resources your program uses. As a very rough rule of thumb, at least 20-50 MB are used by the JRE itself, even for just running something simple like a "Hello world".

The problem with VM tuning, regardless if you just adjust heap size or fine tune gc parameters, is that it has to be redone whenever the problem set changes (eg you could probably get away with -Xmx512m for your current file, but you would need to adjust the value for the next file).

Alternately, you could attempt to reduce the amount of garbage created, ideally to zero. Instead of the scanner, reading line by line, you could read character by character and do the parsing with a state machine. This will greatly reduce garbage creation, but make the code much more complex .

In many cases the most "efficient" solution is simply not to worry about memory usage - the time spent optimizing VM parameters or code would probably be more efficiently spent by focusing on making progress with your program. As long as "overhead" doesn't hinder you, why bother?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM