I'm trying to load large CSV formatted files (typically 200-600mb) efficiently with Java (less memory and as fast as possible access). Currently, the program is utilizing a List of String Arrays. This operation was previously handled with a Lua program using a table for each CSV row and a table to hold each "row" table.
Below is an example of the memory differences and load times:
If I remember correctly, duplicate items in a Lua table exist as a reference to the actual value. I suspect in the Java example, the List is holding separate copies of each duplicate value and that may be related to the larger memory usage.
Below is some background on the data within the CSV files:
Below are some examples of what may be required of the loaded data:
My question - Is there a collection that will require less memory to hold the data yet still offer features to easily and quickly search/sort the data?
One easy solution. You can have some HashMap
were you will put references to all unique strings. And in ArrayList
you will just have reference to existing unique strings in HashMap
.
Something like :
private HashMap<String, String> hashMap = new HashMap<String, String>();
public String getUniqueString(String ns) {
String oldValue = hashMap.get(ns);
if (oldValue != null) { //I suppose there will be no null strings inside csv
return oldValue;
}
hashMap.put(ns, ns);
return ns;
}
Simple usage:
List<String> s = Arrays.asList("Pera", "Zdera", "Pera", "Kobac", "Pera", "Zdera", "rus");
List<String> finS = new ArrayList<String>();
for (String er : s) {
String ns = a.getUniqueString(er);
finS.add(ns);
}
To optimise your your Memory problem i advice to use the Flyweight pattern, specially for fields that have a lot of duplicates.
As a Collection you can use a TreeSet or TreeMap .
If you give a good implementation to your LineItem
class (implement equals
, hashcode
and Comparable ) you can optimise the memory use a lot.
A directed acyclic word graph is the most efficient way to store words (best for memory consumption anyway).
But probably overkill here, as others have said don't create duplicates just make multiple references to the same instance.
Maybe this article can be of some help :
http://www.javamex.com/tutorials/memory/string_saving_memory.shtml
just as a side note.
For the duplicate string data you doubt, you don't need to worry about that, as java itself cares of that as all strings are final, and all references target the same object in memory.
so not sure how lua does the job, but in java it should be also quite efficient
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.