JAVA processing file with java.lang.OutOfMemoryError: GC overhead limit exceeded error

Question

I have the following JAVA class to read from a file containing many lines of tab delimited strings. An example line is like the following:

GO:0085044      GO:0085044      GO:0085044

The code read each line and use split function to put three sub strings into an array, then it put them into a two level hash.

public class LCAReader {
    public static void main(String[] args) {
        Map<String, Map<String, String>> termPairLCA = new HashMap<String, Map<String, String>>();
        File ifile = new File("LCA1.txt");
        try {
            BufferedReader reader = new BufferedReader(new FileReader(ifile));
            String line = null;
            while( (line=reader.readLine()) != null ) {
                String[] arr = line.split("\t");
                if( termPairLCA.containsKey(arr[0]) ) {
                    if( termPairLCA.get(arr[0]).containsKey(arr[1]) ) {
                        System.out.println("Error: Duplicate term in LCACache");
                    } else {
                        termPairLCA.get(arr[0]).put(new String(arr[1]), new String(arr[2]));
                    }
                } else {
                    Map<String, String> tempMap = new HashMap<String, String>();
                    tempMap.put( new String(arr[1]), new String(arr[2]) );
                    termPairLCA.put( new String(arr[0]), tempMap );
                }
            }
            reader.close();
        } catch (IOException e) {
            System.out.println(e.getMessage());
        }
    }
}

When I ran the program, I got the following run time error after some time of running. I noticed the memory usage kept increasing.

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.regex.Pattern.compile(Pattern.java:1469) at java.util.regex.Pattern.(Pattern.java:1150) at java.util.regex.Pattern.compile(Pattern.java:840) at java.lang.String.split(String.java:2304) at java.lang.String.split(String.java:2346) at LCAReader.main(LCAReader.java:17)

The input file is almost 2G and the machine I ran the program has 8G memory. I also tried -Xmx4096m parameter to run the program but that did not help. So I guess there is some memory leak in my code, but I cannot find them.

Can anyone help me on this? Thanks in advance!

Answer 1

There's no memory leak; you're just trying to store too much data. 2GB of text will take 4GB of RAM as Java characters; plus there's about 48 bytes per String object overhead. Assuming the text is in 100 character lines, there's about another GB, for a total of 5GB -- and we haven't even counted the Map.Entry objects yet! You'd need a Java heap of at least, conservatively, 6GB to run this program on your data, and maybe more.

There are a couple of easy things you can do to improve this. First, lose the new String() constructors -- they're useless and just make the garbage collector work harder. Strings are immutable so you never need to copy them. Second, you could use the intern pool to share duplicate strings -- this may or may not help, depending on what the data actually looks like. But you could try, for example,

tempMap.put(arr[1].intern(), arr[2].intern() );

These simple steps might help a lot.

Answer 2

I don't see any leak, you simply need a very huge amount of memory to store your map. There is a very good tool for verifying this: making a heap dump with the option - XX:+HeapDumpOnOutOfMemoryError and import it into Eclipse Memory Analyzer which comes in a standalone version. It can show you the biggest retained objects and the references tree that could prevent the garbage collector to do its job. In addition a profiler such as Netbeans Profiler can give you a lot of interesting real-time informations (for instance to check the number of String and Char instances).

Also it is a good practice to split your code into different classes each having a different responsability: the "two keys map" class (TreeMap) on one side and a "parser" class on the other side, it should make debugging easier...

This is definitely not a good idea to store this huge map inside the RAM... or you need to make a benchkmark with some smaller files and extrapolate to obtain the estimated RAM you need to have on your system to fit your worste case... And set Xmx to the proper value. Why don't you use a Key Value store such as Berckley DB: simpler than a Relational DB and should fit exactly you need of two levels indexing. Check this post for the choice of the store: key-value store suggestion

Good luck

Answer 3

You probably shouldn't use String.split and store the information as pure String as this generates lots of String objects on the fly.

Try using a char based approach since your format seems rather fixed so you know the exact indizes of the different data points on one line.

If your a bit more into experimenting you could try to use a NIO-backed approach with a memory mapped DirectByteBuffer or a CharBuffer that is used to traverse the file. There you could just mark the indizes of different data points into Marker-objects and only load the real String -data later on in the process when needed.

JAVA processing file with java.lang.OutOfMemoryError: GC overhead limit exceeded error

Question

3 answers

solution1
3 2012-04-15 03:02:03

solution2
0 2012-04-15 18:31:02

solution3
0 2012-04-15 18:50:58

JAVA processing file with java.lang.OutOfMemoryError: GC overhead limit exceeded error

Question

3 answers

solution1 3 2012-04-15 03:02:03

solution2 0 2012-04-15 18:31:02

solution3 0 2012-04-15 18:50:58

solution1
3 2012-04-15 03:02:03

solution2
0 2012-04-15 18:31:02

solution3
0 2012-04-15 18:50:58