简体   繁体   English

Java HashSet <Long>需要多少内存

[英]How much memory Java HashSet<Long> should take

I wanted to use a HashSet<Long> for storing a large list of unique numbers in memory. 我想使用HashSet<Long>在内存中存储大量唯一数字。 I calculated the approximate memory to be consumed (in 64 bit pointer size): 我计算了要消耗的大概内存(64位指针大小):

Long would take 16 bytes of space. Long会占用16个字节的空间。 So initially I multiplied the number of entries with 16 to get the memory. 所以最初我将条目数乘以16得到内存。 But in reality, the memory was much more than 16 bytes per entry. 但实际上,每个条目的内存大大超过16个字节。 After that I studied HashSet implementation. 之后我研究了HashSet实现。 In short, in the underlying implementation, it actually stores an extra dummy object (12 bytes) with each entry of hashset . 简而言之,在底层实现中,它实际上存储了每个hashset条目的额外虚拟对象(12个字节)。 And a pointer (8 bytes) to next entry. 并指向下一个条目的指针(8个字节)。 Thus conceding extra 12+8 bytes per entry. 因此,每个条目承认额外的12 + 8字节。

So total memory per entry: 16+12+8 = 36 bytes. 因此每个条目的总内存:16 + 12 + 8 = 36个字节。 But still when I ran the code and checked the memory, it was still much more than 36 bytes per entry. 但是当我运行代码并检查内存时,每个条目仍然超过36个字节。

My Question(In short) : How much memory does a HashSet take (for instance, on 64 bit machine)? 我的问题(简而言之)HashSet占用多少内存(例如,在64位机器上)?

You can measure exactly this size using this test: 您可以使用此测试准确测量此大小:

    long m1 = Runtime.getRuntime().freeMemory();
    // create object (s) here
    long m2 = Runtime.getRuntime().freeMemory();
    System.out.println(m1 - m2);

to be run with -XX:-UseTLAB option 使用-XX:-UseTLAB选项运行

On my 64-bit HotSpot empty HashSet takes 480 bytes. 在我的64位HotSpot上,空HashSet需要480个字节。

Why so much? 为什么这么多? Because HashSet has a complex structure (btw IDE in debug mode helps see actual fields). 因为HashSet具有复杂的结构(在调试模式下的btw IDE有助于查看实际字段)。 It is based on HashMap (Adapter pattern). 它基于HashMap(适配器模式)。 So HashSet itself contains a reference to a HashMap. 所以HashSet本身包含对HashMap的引用。 HashMap contains 8 fields. HashMap包含8个字段。 Actual data are in an array of Nodes. 实际数据位于节点数组中。 A Node has: int hash; Node有:int hash; K key; K键; V value; V值; Node next. 节点接下来。 HashSet uses only keys and puts a dummy object in values. HashSet仅使用键并将虚拟对象放入值中。

The size of objects is an implementation detail. 对象的大小是一个实现细节。 There is no guarantee that if it's x bytes on one platform, on another it's also x bytes. 如果它在一个平台上是x个字节,则无法保证在另一个平台上它也是x个字节。

Long is boxed as you know, but 16 bytes is wrong. 如你所知, Long是盒装的,但是16个字节是错误的。 The primitive long takes 8 bytes but the size of the box around the long is implementation dependent. 原始long需要8个字节,但long周围的框的大小取决于实现。 According to this Hotspot related answer overhead words and padding means a boxed 4-byte int can come to 24 bytes! 根据这个Hotspot相关的答案开销词和填充意味着一个盒装的4字节int可以达到24个字节!

The byte alignment and padding mentioned in that (Hotspot specific) answer also would apply to the Entry objects which would also push the consumption up. 该(特定于Hotspot)答案中提到的字节对齐和填充也将应用于Entry对象,这也将推动消耗。

使用的内存是32 * SIZE + 4 * CAPACITY +(16 * SIZE)beign“SIZE”元素的数量。

HashMap default size is 16 HashMapEntry entries. HashMap的默认大小是16个HashMapEntry条目。 Every HashMapEntry has four objects on it (int keyHash, Object next, Object key, Object value). 每个HashMapEntry上都有四个对象(int keyHash,Object next,Object key,Object value)。 So it introduces overhead just for having empty entries by wrapping the elements. 因此,它通过包装元素来引入空条目的开销。 Additionally, hashmap has a expansion rate of 2x, so for 17 elements, you'll have 32 entries with 15 of them empty. 此外,hashmap的扩展速率为2x,因此对于17个元素,您将有32个条目,其中15个为空。

Easier way is check a heapdump with memory analyzer. 更简单的方法是使用内存分析器检查heapdump。

A HashSet is a complicated beast. HashSet是一个复杂的野兽。 Off the top of my head and after reviewing some of the comments, here are some items consuming memory that you have not accounted for: 在审查了一些评论之后,这里有一些消耗内存的项目,你没有考虑到:

  1. Java collections (true collections, not plain arrays) can only take object references, not primitives. Java集合(真集合,而不是普通数组)只能接受对象引用,而不是基元。 Therefore, your long primitive gets boxed into a java.lang.Long object and a reference added to the HashSet. Somebody mentioned that a 因此,您的long原语被装入java.lang.Long对象并添加到HashSet. Somebody mentioned that a的引用HashSet. Somebody mentioned that a HashSet. Somebody mentioned that a Long` object will be 24 bytes. HashSet. Somebody mentioned that a Long`对象将是24个字节。 Plus the reference, which is 8 bytes. 加上参考,即8个字节。
  2. The hash table buckets are collections. 哈希表桶是集合。 I don't recall if they are arrays or ArrayList , or LinkedList , etc., but because hashing algorithms could produce collisions, the elements of the HashSet must be put into collections, which are organized by hash code. 我不记得它们是数组还是ArrayListLinkedList等,但是因为散列算法可能产生冲突,所以必须将HashSet的元素放入集合中,这些集合由散列码组织。 Best case is an ArrayList with just 1 element: Your Long object. 最好的情况是只有1个元素的ArrayList :您的Long对象。 The default backing array size for ArrayList is 10, so you have 10 object references within the object, so at least 80 bytes now per Long . ArrayList的默认后备数组大小为10,因此在对象中有10个对象引用,因此每个Long现在至少有80个字节。 Since Long is an integer, I suspect the hashing algorithm does a good job spreading things out. 由于Long是一个整数,我怀疑散列算法可以很好地解决问题。 I'm not sure what would happen to a long whose value exceeded the Integer.MAX_VALUE. 我不确定其值超过Integer.MAX_VALUE的long会发生什么。 That would have to collide somehow due to the birthday paradox. 由于生日悖论,这将不得不以某种方式发生冲突。
  3. The actual hash table - HashSet is basically a HashMap where the value is not interesting. 实际的哈希表HashSet基本上是一个HashMap ,其值HashMap Under the hood, it creates a HashMap , which has an array of buckets in it to represent the hash table. 在引擎盖下,它创建了一个HashMap ,其中有一个桶数组来表示哈希表。 The array size is based on the capacity, which is not clear based on the number of elements you added. 阵列大小基于容量,根据您添加的元素数量不明确。
  4. The size of the hash table will usually, intentionally, have more buckets than needed, in order to make future growth easier. 为了使未来的增长更容易,哈希表的大小通常会有意地拥有比所需更多的桶。 Hopefully it's not a lot more. 希望它不是更多。 But don't expect that 5 elements takes exactly 5 buckets. 但是不要指望5个元素恰好需要5个桶。

Long-story short, hash tables are a memory-intensive data structure. 长篇简短的哈希表是一种内存密集型数据结构。 It's the space/time trade-off. 这是空间/时间的权衡。 You get, assuming a good hash distribution, constant time look-ups, at the cost of extra memory usage. 假设一个良好的散列分布,你可以获得额外的内存使用成本。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM