Java - how to efficiently store a large amount of String arrays

Question

I'm trying to load large CSV formatted files (typically 200-600mb) efficiently with Java (less memory and as fast as possible access). Currently, the program is utilizing a List of String Arrays. This operation was previously handled with a Lua program using a table for each CSV row and a table to hold each "row" table.

Below is an example of the memory differences and load times:

CSV File - 232mb
Lua - 549mb in memory - 157 seconds to load
Java - 1,378mb in memory - 12 seconds to load

If I remember correctly, duplicate items in a Lua table exist as a reference to the actual value. I suspect in the Java example, the List is holding separate copies of each duplicate value and that may be related to the larger memory usage.

Below is some background on the data within the CSV files:

Each field consists of a String
Specific fields within each row may include one of a set of Strings (Eg field 3 could be "red", "green", or "blue").
There are many duplicate Strings within the content.

Below are some examples of what may be required of the loaded data:

Search through all Strings attempting to match with a given String and return the matching Strings
Display matches in a GUI table (sort able via fields).
Alter or replace Strings.

My question - Is there a collection that will require less memory to hold the data yet still offer features to easily and quickly search/sort the data?

Answer 1

One easy solution. You can have some HashMap were you will put references to all unique strings. And in ArrayList you will just have reference to existing unique strings in HashMap .

Something like :

private HashMap<String, String> hashMap = new HashMap<String, String>();

public String getUniqueString(String ns) {
   String oldValue = hashMap.get(ns);
   if (oldValue != null) { //I suppose there will be no null strings inside csv
    return oldValue;
   }        
   hashMap.put(ns, ns);
   return ns;
}

Simple usage:

List<String> s = Arrays.asList("Pera", "Zdera", "Pera", "Kobac", "Pera", "Zdera", "rus");
List<String> finS = new ArrayList<String>();
for (String er : s) {
   String ns = a.getUniqueString(er);
   finS.add(ns);
}

Answer 2

To optimise your your Memory problem i advice to use the Flyweight pattern, specially for fields that have a lot of duplicates.

As a Collection you can use a TreeSet or TreeMap .

If you give a good implementation to your LineItem class (implement equals , hashcode and Comparable ) you can optimise the memory use a lot.

Answer 3

DAWG

A directed acyclic word graph is the most efficient way to store words (best for memory consumption anyway).

But probably overkill here, as others have said don't create duplicates just make multiple references to the same instance.

Answer 4

Maybe this article can be of some help :

http://www.javamex.com/tutorials/memory/string_saving_memory.shtml

Answer 5

just as a side note.

For the duplicate string data you doubt, you don't need to worry about that, as java itself cares of that as all strings are final, and all references target the same object in memory.

so not sure how lua does the job, but in java it should be also quite efficient

Java - how to efficiently store a large amount of String arrays

Question

5 answers

solution1
1 2012-11-11 16:32:34

solution2
0 2012-11-11 15:50:01

solution3
0 2012-11-11 15:51:33

solution4
0 ACCPTED 2012-11-11 15:52:31

solution5
0 2012-11-11 16:12:22

Java - how to efficiently store a large amount of String arrays

Question

5 answers

solution1 1 2012-11-11 16:32:34

solution2 0 2012-11-11 15:50:01

solution3 0 2012-11-11 15:51:33

solution4 0 ACCPTED 2012-11-11 15:52:31

solution5 0 2012-11-11 16:12:22

solution1
1 2012-11-11 16:32:34

solution2
0 2012-11-11 15:50:01

solution3
0 2012-11-11 15:51:33

solution4
0 ACCPTED 2012-11-11 15:52:31

solution5
0 2012-11-11 16:12:22