简体   繁体   中英

HashMap resize method implementation detail

As the title suggests this is a question about an implementation detail from HashMap#resize - that's when the inner array is doubled in size. It's a bit wordy, but I've really tried to prove that I did my best understanding this...

This happens at a point when entries in this particular bucket/bin are stored in a Linked fashion - thus having an exact order and in the context of the question this is important .

Generally the resize could be called from other places as well, but let's look at this case only.

Suppose you put these strings as keys in a HashMap (on the right there's the hashcode after HashMap#hash - that's the internal re-hashing.) Yes, these are carefully generated, not random.

 DFHXR - 11111
 YSXFJ - 01111 
 TUDDY - 11111 
 AXVUH - 01111 
 RUTWZ - 11111
 DEDUC - 01111
 WFCVW - 11111
 ZETCU - 01111
 GCVUR - 11111 

There's a simple pattern to notice here - the last 4 bits are the same for all of them - which means that when we insert 8 of these keys (there are 9 total), they will end-up in the same bucket; and on the 9-th HashMap#put , the resize will be called.

So if currently there are 8 entries (having one of the keys above) in the HashMap - it means there are 16 buckets in this map and the last 4 bits of they key decided where the entries end-up.

We put the nine-th key. At this point TREEIFY_THRESHOLD is hit and resize is called. The bins are doubled to 32 and one more bit from the keys decides where that entry will go (so, 5 bits now).

Ultimately this piece of code is reached (when resize happens):

 Node<K,V> loHead = null, loTail = null;
 Node<K,V> hiHead = null, hiTail = null;
 Node<K,V> next;
 do {
     next = e.next;
     if ((e.hash & oldCap) == 0) {
          if (loTail == null)
               loHead = e;
          else
               loTail.next = e;
          loTail = e;
     }
     else {
        if (hiTail == null)
            hiHead = e;
        else
            hiTail.next = e;
        hiTail = e;
     }
 } while ((e = next) != null);



 if (loTail != null) {
     loTail.next = null;
     newTab[j] = loHead;
 }
 if (hiTail != null) {
     hiTail.next = null;
     newTab[j + oldCap] = hiHead;
 }

It's actually not that complicated... what it does it splits the current bin into entries that will move to other bins and to entries that will not move to other bins - but will remain into this one for sure.

And it's actually pretty smart how it does that - it's via this piece of code:

 if ((e.hash & oldCap) == 0) 

What this does is check if the next bit (the 5-th in our case) is actually zero - if it is, it means that this entry will stay where it is; if it's not it will move with a power of two offset in the new bin.

And now finally the question: that piece of code in the resize is carefully made so that it preserves the order of the entries there was in that bin.

So after you put those 9 keys in the HashMap the order is going to be:

DFHXR -> TUDDY -> RUTWZ -> WFCVW -> GCVUR (one bin)

YSXFJ -> AXVUH -> DEDUC -> ZETCU (another bin)

Why would you want to preserve order of some entries in the HashMap . Order in a Map is really bad as detailed here or here .

The design consideration has been documented within the same source file, in a code comment in line 211

* When bin lists are treeified, split, or untreeified, we keep * them in the same relative access/traversal order (ie, field * Node.next) to better preserve locality, and to slightly * simplify handling of splits and traversals that invoke * iterator.remove. When using comparators on insertion, to keep a * total ordering (or as close as is required here) across * rebalancings, we compare classes and identityHashCodes as * tie-breakers.

Since removing mappings via an iterator can't trigger a resize, the reasons to retain the order specifically in resize are “to better preserve locality, and to slightly simplify handling of splits”, as well as being consistent regarding the policy.

Order in a Map is really bad [...]

It's not bad, it's (in academic terminology) whatever. What Stuart Marks wrote at the first link you posted:

[...] preserve flexibility for future implementation changes [...]

Which means (as I understand it) that now the implementation happens to keep the order, but in the future if a better implementation is found, it will be used either it keeps the order or not.

There are two common reasons for maintaining order in bins implemented as a linked list:

One is that you maintain order by increasing (or decreasing) hash-value. That means when searching a bin you can stop as soon as the current item is greater (or less, as applicable) than the hash being searched for.

Another approach involves moving entries to the front (or nearer the front) of the bucket when accessed or just adding them to the front. That suits situations where the probability of an element being accessed is high if it has just been accessed.

I've looked at the source for JDK-8 and it appears to be (at least for the most part) doing the later passive version of the later (add to front):

http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/util/HashMap.java

While it's true that you should never rely on iteration order from containers that don't guarantee it, that doesn't mean that it can't be exploited for performance if it's structural. Also notice that the implementation of a class is in a privilege position to exploit details of its implementation in a formal way that a user of that class should not.

If you look at the source and understand how its implemented and exploit it, you're taking a risk. If the implementer does it, that's a different matter!

Note: I have an implementation of an algorithm that relies heavily on a hash-table called Hashlife. That uses this model, have a hash-table that's a power of two because (a) you can get the entry by bit-masking (& mask) rather than a division and (b) rehashing is simplified because you only every 'unzip' hash-bins.

Benchmarking shows that algorithm gaining around 20% by actively moving patterns to the front of their bin when accessed.

The algorithm pretty much exploits repeating structures in cellular automata, which are common so if you've seen a pattern the chances of seeing it again are high.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM