简体   繁体   English

java内存中大小优化

[英]java in-memory size optimization

I'm writing some "big data" software that needs to hold a lot of data in memory. 我正在编写一些需要在内存中保存大量数据的“大数据”软件。 I wrote a prototype in c++ that works great. 我用c ++写了一个非常好的原型。 However the actual end-users typically code in Java so they've asked me to also write a java prototype. 然而,实际的最终用户通常使用Java编写代码,因此他们要求我也编写Java原型。

I've done background reading on memory-footprint in java and some preliminary tests. 我已经完成了java中内存占用的背景阅读和一些初步测试。 For example, lets say I have this object 例如,假设我有这个对象

public class DataPoint{

    int cents, time, product_id, store_id;

    public DataPoint(int cents, int time, int product_id, int store_id){
    this.cents = cents;
    this.time = time;
    this.product_id = product_id;
    this.store_id = store_id;
    }
}

In C++ the sizeof this structure is 16 bytes, which makes sense. 在C ++中,这个结构的大小是16个字节,这是有道理的。 In Java we have to be indirect. 在Java中,我们必须是间接的。 If I create, eg, 10m of these objects and use Runtime.totalMemory() - Runtime.freeMemory() before and after and then divide as appropriate I get approximately 36 bytes per structure. 如果我创建了例如10m的这些对象并使用Runtime.totalMemory() - Runtime.freeMemory()之前和之后然后根据需要进行除法,每个结构大约需要36个字节。 A ~2.4x memory difference is pretty nasty; 一个~2.4倍的内存差异非常讨厌; its gonna get ugly when we try to hold hundreds of millions of DataPoints in memory. 当我们试图在内存中保存数亿个DataPoints时,它会变得丑陋。

I read somewhere that in cases like this in Java its better to store the data as arrays -- essentially a column-based store rather than a row-based store. 我在某处看到,在Java中这样的情况下,将数据存储为数组更好 - 实质上是基于列的存储而不是基于行的存储。 I think I understand this: the column-based way reduces the number of number of references, and perhaps the JVM can even pack the ints into 8-byte words intelligently. 我想我理解这一点:基于列的方式减少了引用数量,也许JVM甚至可以智能地将整数打包成8字节的字。

What other tricks can I use for reducing the memory-footprint of what is essentially a memory block that has one very large dimension (millions/billions of datapoints) and one very small dimension (the O(1) number of columns/variables)? 我可以使用哪些其他技巧来减少内存占用的内存占用内存块,该内存块具有一个非常大的维度(数百万/数十亿的数据点)和一个非常小的维度(O(1)列数/变量)?

Turns out storing the data as 4 int arrays used exactly 16 bytes per entry. 结果是将数据存储为4个int数组,每个条目恰好使用16个字节。 The lesson: small objects have nasty proportional overhead in java. 经验教训:小对象在java中具有令人讨厌的比例开销。

It isn't that straightforward to see how much memory your data structure takes in Java. 查看数据结构在Java中占用多少内存并不是那么简单。 totalMemory() shows the space allocated for vm which is larger than the actual usage. totalMemory()显示为vm分配的空间,该空间大于实际使用情况。 You could try using Java profiler that shows space-consumption of your data structures, they are quite easy to setup and run. 您可以尝试使用显示数据结构空间消耗的Java分析器,它们很容易设置和运行。 One handy free tool is Java's own VisualVM that for example shows memory behaviour of your application, you will also learn a bit about how Java's GC works if you use it. 一个方便的免费工具是Java自己的VisualVM ,例如显示应用程序的内存行为,如果使用它,您还将学习Java的GC如何工作。

VisualVM screenshot showing performance footprint (image from http://visualvm.java.net/features.html ): 显示性能足迹的VisualVM屏幕截图(来自http://visualvm.java.net/features.html的图片): 在此输入图像描述

You should also consider making the variables final if it's possible. 如果可能的话,你还应该考虑让变量最终。 It allows Java VM to optimize the code bit better (not sure if it saves space though). 它允许Java VM更好地优化代码位(不确定它是否可以节省空间)。

First of all an object in Java will always be slightly larger than a C++ version since the object encapsulates runtime type information that enables you to do instanceof etc that is not possible in C++ . 首先, Java的对象总是比C++版本略大,因为该对象封装了运行时类型信息,使您能够执行C++无法实现的instanceof等。 Additionally it facilitates in the memory management you would have to manually do yourself, so you can also consider this part of your C++ code as not part of the code base. 此外,它还有助于您自己手动执行内存管理,因此您也可以将C++代码的这一部分视为代码库的一部分。

You could look into Flyweight Pattern to reduce memory requirements so that you reuse the DataPoints (make the class Immutable ). 您可以查看Flyweight模式以减少内存需求,以便重用DataPoints (使类为Immutable )。 I assume that if you have billions of points as you say some will probably be the same values. 我假设如果你有数十亿的点,你说有些可能是相同的值。
I am sure others here will give some more concrete information on optimizing in memory space 我相信这里的其他人会提供一些关于内存空间优化的具体信息

Depending on the value ranges you might be able to use smaller data types. 根据值范围,您可以使用较小的数据类型。 Can you get away with using byte or short for some of the members? 对于某些成员,你可以使用byte或short吗?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM