简体   繁体   中英

How much memory Java HashSet<Long> should take

I wanted to use a HashSet<Long> for storing a large list of unique numbers in memory. I calculated the approximate memory to be consumed (in 64 bit pointer size):

Long would take 16 bytes of space. So initially I multiplied the number of entries with 16 to get the memory. But in reality, the memory was much more than 16 bytes per entry. After that I studied HashSet implementation. In short, in the underlying implementation, it actually stores an extra dummy object (12 bytes) with each entry of hashset . And a pointer (8 bytes) to next entry. Thus conceding extra 12+8 bytes per entry.

So total memory per entry: 16+12+8 = 36 bytes. But still when I ran the code and checked the memory, it was still much more than 36 bytes per entry.

My Question(In short) : How much memory does a HashSet take (for instance, on 64 bit machine)?

You can measure exactly this size using this test:

    long m1 = Runtime.getRuntime().freeMemory();
    // create object (s) here
    long m2 = Runtime.getRuntime().freeMemory();
    System.out.println(m1 - m2);

to be run with -XX:-UseTLAB option

On my 64-bit HotSpot empty HashSet takes 480 bytes.

Why so much? Because HashSet has a complex structure (btw IDE in debug mode helps see actual fields). It is based on HashMap (Adapter pattern). So HashSet itself contains a reference to a HashMap. HashMap contains 8 fields. Actual data are in an array of Nodes. A Node has: int hash; K key; V value; Node next. HashSet uses only keys and puts a dummy object in values.

The size of objects is an implementation detail. There is no guarantee that if it's x bytes on one platform, on another it's also x bytes.

Long is boxed as you know, but 16 bytes is wrong. The primitive long takes 8 bytes but the size of the box around the long is implementation dependent. According to this Hotspot related answer overhead words and padding means a boxed 4-byte int can come to 24 bytes!

The byte alignment and padding mentioned in that (Hotspot specific) answer also would apply to the Entry objects which would also push the consumption up.

使用的内存是32 * SIZE + 4 * CAPACITY +(16 * SIZE)beign“SIZE”元素的数量。

HashMap default size is 16 HashMapEntry entries. Every HashMapEntry has four objects on it (int keyHash, Object next, Object key, Object value). So it introduces overhead just for having empty entries by wrapping the elements. Additionally, hashmap has a expansion rate of 2x, so for 17 elements, you'll have 32 entries with 15 of them empty.

Easier way is check a heapdump with memory analyzer.

A HashSet is a complicated beast. Off the top of my head and after reviewing some of the comments, here are some items consuming memory that you have not accounted for:

  1. Java collections (true collections, not plain arrays) can only take object references, not primitives. Therefore, your long primitive gets boxed into a java.lang.Long object and a reference added to the HashSet. Somebody mentioned that a HashSet. Somebody mentioned that a Long` object will be 24 bytes. Plus the reference, which is 8 bytes.
  2. The hash table buckets are collections. I don't recall if they are arrays or ArrayList , or LinkedList , etc., but because hashing algorithms could produce collisions, the elements of the HashSet must be put into collections, which are organized by hash code. Best case is an ArrayList with just 1 element: Your Long object. The default backing array size for ArrayList is 10, so you have 10 object references within the object, so at least 80 bytes now per Long . Since Long is an integer, I suspect the hashing algorithm does a good job spreading things out. I'm not sure what would happen to a long whose value exceeded the Integer.MAX_VALUE. That would have to collide somehow due to the birthday paradox.
  3. The actual hash table - HashSet is basically a HashMap where the value is not interesting. Under the hood, it creates a HashMap , which has an array of buckets in it to represent the hash table. The array size is based on the capacity, which is not clear based on the number of elements you added.
  4. The size of the hash table will usually, intentionally, have more buckets than needed, in order to make future growth easier. Hopefully it's not a lot more. But don't expect that 5 elements takes exactly 5 buckets.

Long-story short, hash tables are a memory-intensive data structure. It's the space/time trade-off. You get, assuming a good hash distribution, constant time look-ups, at the cost of extra memory usage.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM