简体   繁体   中英

What is the optimal hashmap capacity/load factor for 5-10M entries?

I have about 5-10M entries in a HashMap , and I can't change the code structure. I'm running java with -Xms=512m -Xmx=1024m . What is the optimal capacity/load-factor values in the HashMap constructor to avoid java.lang.OutOfMemoryError: GC overhead limit exceeded ?

private final Map<String, ReportResultView> aggregatedMap = new HashMap<>(????, ????);

Summary: In this scenario the load factor might seem interesting, but it can't be the underlying cause of your OOMEs since the load factor only controls the wasted backing array space, and by default (load factor of 0.75) that only consumes ~2.5% of your heap (and doesn't cause high-object count GC-pressure). More likely, the space used by your stored objects and their associated HashMap.Entry objects has consumed the heap.

Details: The load factor for a HashMap controls the size of the underlying array of references used by the map. A smaller load factor means fewer empty array elements at a given size. So in general, increasing the load factor results in less memory use, since there are fewer empty array slots. 3

That established, however, it is unlikely you can solve your OOMEs by adjusting the load factor. An empty array element, however, only "wastes" 4 bytes 1 . So for an array of 5M-10M elements, a load factor of 0.75 (the default), will waste something like 25 MB of memory 2 .

That's only a small fraction of the 1,024 MB of heap memory you are allocating, so you aren't going to be able to solve your OOMEs by adjusting your load factor (unless you were using something very silly, like an extremely low load factor of 0.05 or something). The default load factor will be fine.

Most likely it is actual size of the objects and object Entry s stored in the HashMap that is causing the problem. Each mapping has a HashMap.Entry object that holds the key/value pair and a couple of other fields (eg the hashcode, and a pointer to the next item when chained). This Entry object itself consumes about 32 bytes - when added to the 4 bytes for the underlying array entry, that's 40 bytes * 10M entries = 400M of heap for the overhead of the entries alone . Then the actual objects you are storing take space too: if your object has even a handful of fields, they will be at least as large as the Entry objects and your heap is pretty much exhausted.

The fact that you are getting a GC limit exceeded error rather than a heap alloc failed generally means you are approaching the heap limit slowly, churning a lot of objects: the GC tends to fail in that way in that scenario, before running out of space.

So most likely you simply need to allocate more heap to your application, find a way of storing fewer elements, or reduce the per-element size (eg, with a different data structure or object representation).


[1] Usually 4 bytes on HotSpot anyway, even when running the 64-bit JDK - although it may be 8 bytes on some 64-bit platforms if compressed oops is disabled for some reason.

[2] Worst case, 0.75 load factor means a load of 0.75 / 2 = 0.375 after resize, so you have (1 - 0.375) * 10,000,000 empty elements, at 4 bytes per element = ~25 MB. During rehash you could add another factor of 1.5 or so, in the worst case, since both the old and new backing arrays will be on the heap simultaneously. When the map sizes stabilizes though, this doesn't apply.

[3] This is true even with chaining, since in general the use of chaining doesn't increase memory use (ie, the Entry elements already have the "next" pointer embedded regardless if the element is in the chain or not). Java 8 complicates things since the HashMap implement was improved such that large chains may be converted into trees, which may increase the footprint.

to avoid java.lang.OutOfMemoryError: GC overhead limit exceeded ?

When hashmap resizes it needs to reallocate the internal table. So you need give your VM enough memory to juggle that temporary copy or presize the hashmap to prevent the resizing from ever occurring.

You could also take a look at the hashmap implementation from https://github.com/boundary/high-scale-lib which should provide less disruptive resizing behavior.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM