简体   繁体   中英

java in-memory size optimization

I'm writing some "big data" software that needs to hold a lot of data in memory. I wrote a prototype in c++ that works great. However the actual end-users typically code in Java so they've asked me to also write a java prototype.

I've done background reading on memory-footprint in java and some preliminary tests. For example, lets say I have this object

public class DataPoint{

    int cents, time, product_id, store_id;

    public DataPoint(int cents, int time, int product_id, int store_id){
    this.cents = cents;
    this.time = time;
    this.product_id = product_id;
    this.store_id = store_id;
    }
}

In C++ the sizeof this structure is 16 bytes, which makes sense. In Java we have to be indirect. If I create, eg, 10m of these objects and use Runtime.totalMemory() - Runtime.freeMemory() before and after and then divide as appropriate I get approximately 36 bytes per structure. A ~2.4x memory difference is pretty nasty; its gonna get ugly when we try to hold hundreds of millions of DataPoints in memory.

I read somewhere that in cases like this in Java its better to store the data as arrays -- essentially a column-based store rather than a row-based store. I think I understand this: the column-based way reduces the number of number of references, and perhaps the JVM can even pack the ints into 8-byte words intelligently.

What other tricks can I use for reducing the memory-footprint of what is essentially a memory block that has one very large dimension (millions/billions of datapoints) and one very small dimension (the O(1) number of columns/variables)?

Turns out storing the data as 4 int arrays used exactly 16 bytes per entry. The lesson: small objects have nasty proportional overhead in java.

It isn't that straightforward to see how much memory your data structure takes in Java. totalMemory() shows the space allocated for vm which is larger than the actual usage. You could try using Java profiler that shows space-consumption of your data structures, they are quite easy to setup and run. One handy free tool is Java's own VisualVM that for example shows memory behaviour of your application, you will also learn a bit about how Java's GC works if you use it.

VisualVM screenshot showing performance footprint (image from http://visualvm.java.net/features.html ): 在此输入图像描述

You should also consider making the variables final if it's possible. It allows Java VM to optimize the code bit better (not sure if it saves space though).

First of all an object in Java will always be slightly larger than a C++ version since the object encapsulates runtime type information that enables you to do instanceof etc that is not possible in C++ . Additionally it facilitates in the memory management you would have to manually do yourself, so you can also consider this part of your C++ code as not part of the code base.

You could look into Flyweight Pattern to reduce memory requirements so that you reuse the DataPoints (make the class Immutable ). I assume that if you have billions of points as you say some will probably be the same values.
I am sure others here will give some more concrete information on optimizing in memory space

Depending on the value ranges you might be able to use smaller data types. Can you get away with using byte or short for some of the members?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM