简体   繁体   English

Hashmap 负载因子 - 基于占用的存储桶数或所有存储桶中的条目数?

[英]Hashmap loadfactor - based on number of buckets occupied or number of entries in all buckets?

I was trying to understand the that does the rehashing of hashmap takes place while exceeding the number of buckets occupied or the total number of entries in all buckets.我试图了解 hashmap 的重新散列是在超过占用的桶数或所有桶中的条目总数时发生的。 Means, we know that if 12 out of 16 (One entry in each bucket) of buckets are full (Considering default loadfactor and initial capacity), then we know on the next entry the hashmap will be rehashed.意味着,我们知道如果 16 个桶中的 12 个(每个桶中有一个条目)已满(考虑默认负载因子和初始容量),那么我们知道在下一个条目中 hashmap 将被重新散列。 But what about that case when suppose only 3 buckets are occupied with 4 entries each (Total 12 entries but only 3 buckets out of 16 in use)?但是,如果假设只有 3 个存储桶被每个 4 个条目占用(总共 12 个条目,但 16 个存储桶中只有 3 个在使用中),那这种情况呢?

So I tried to replicate this by making the worst hash function which will put all entries in a single bucket.所以我试图通过制作最糟糕的 hash function 来复制这一点,这会将所有条目放在一个桶中。

Here is my code.这是我的代码。

class X {

    public Integer value;

    public X(Integer value) {
        super();
        this.value = value;
    }

    @Override
    public int hashCode() {
        return 1;
    }

    @Override
    public boolean equals(Object obj) {
        X a = (X) obj;
        if(this.value.equals(a.value)) {
            return true;
        }
        return false;
    }

}

Now I started putting values in hashmap.现在我开始在 hashmap 中输入值。

HashMap<X, Integer> map = new HashMap<>();
    map.put(new X(1), 1);
    map.put(new X(2), 2);
    map.put(new X(3), 3);
    map.put(new X(4), 4);
    map.put(new X(5), 5);
    map.put(new X(6), 6);
    map.put(new X(7), 7);
    map.put(new X(8), 8);
    map.put(new X(9), 9);
    map.put(new X(10), 10);
    map.put(new X(11), 11);
    map.put(new X(12), 12);
    map.put(new X(13), 13);
    System.out.println(map.size());

All the nodes were entering the single bucket as expected, but I noticed that on the 9th entry, the hashmap rehashed and doubled its capacity.所有节点都按预期进入了单个存储桶,但我注意到在第 9 次进入时,hashmap 重新哈希并使其容量增加了一倍。 Now on the 10th entry, It again doubled its capacity.现在在第 10 次入境时,它的容量再次翻了一番。

有 8 个条目 有 9 个条目

Can anyone explain how this is happening?谁能解释这是怎么回事?

Thanks in advance.提前致谢。

Responding to the comments more than the question itself, since your comments are more relevant in what you want to know actually. 回答评论而不是问题本身,因为您的评论与您实际想知道的内容更相关。

The best and most relevant answer to the where this rehashing on bucket size is explained further is the source code itself. 最佳和最相关的答案是对源代码本身进行where this rehashing on bucket size is explained further What you observe on the 9-th entry is expected and happens in this portion of the code: 您在第9-th条目上观察到的内容是预期的并且在代码的这一部分中发生:

for (int binCount = 0; ; ++binCount) {
    // some irrelevant lines skipped
    if (binCount >= TREEIFY_THRESHOLD - 1) // -1 for 1st
        treeifyBin(tab, hash);
        break;
    }

where TREEIFY_THRESHOLD = 8 and binCount is the number of bins. 其中TREEIFY_THRESHOLD = 8binCount是箱数。

That treeifyBin method name is a bit misleading since, it might re-size, not treefiy a bin, that's the relevant part of the code from that method: 那个treeifyBin方法名称有点误导,因为它可能会重新调整大小,而不是树形容器,这是该方法代码的相关部分:

if (tab == null || (n = tab.length) < MIN_TREEIFY_CAPACITY)
        resize();

Notice that it will actually resize (read double it's size) and not make a Tree until MIN_TREEIFY_CAPACITY is reached (64). 请注意,它实际上将resize (阅读它的两倍大小),而不是做一个Tree ,直到MIN_TREEIFY_CAPACITY达到(64)。

In HashMap, entries will be present in same bucket if their hashcodes are same. 在HashMap中,如果条目的哈希码相同,则条目将出现在同一个桶中。 If unique Integer objects are placed inside a hashMap, their hashcode will change definitely because they are different objects. 如果将唯一的Integer对象放在hashMap中,则它们的hashcode肯定会发生变化,因为它们是不同的对象。

But in your case all the objects are having same hashcode. 但在您的情况下,所有对象都具有相同的哈希码。 which means as you specified all entries will be in a single bucket. 这意味着您指定的所有条目都将在一个存储桶中。 Each of these buckets follow a specific data structure(linkedList/tree). 这些桶中的每一个都遵循特定的数据结构(linkedList / tree)。 Here the capacity is changing according to that specific datastructure and hashmap. 这里的容量根据特定的数据结构和hashmap而变化。

I have run JB Nizet's code ( https://gist.github.com/jnizet/34ca08ba0314c8e857ea9a161c175f13/revisions ) mentioned in the comment with loop limits 64 and 128 (adding 64 and 128 elements): 我已经在评论中提到了JB Nizet的代码( https://gist.github.com/jnizet/34ca08ba0314c8e857ea9a161c175f13/revisions ),循环限制为64和128(添加了64和128个元素):

  1. While adding 64 elements : The capacity got doubled(128) while adding 49th element to the HashMap (because load factor is 0.75) 添加64个元素时 :容量加倍(128),同时向HashMap添加第49个元素(因为加载因子为0.75)
  2. While adding 128 elements : The capacity got doubled(256) while adding 97th element to the HashMap (also because load factor is 0.75). 添加128个元素时 :容量加倍(256),同时将第97个元素添加到HashMap(也因为加载因子为0.75)。

After increasing capacity to 64, the HashMap works normal. 将容量增加到64后,HashMap正常工作。

In summary, bucket uses linked list to a certain length(8 elements). 总之,bucket使用链表到一定长度(8个元素)。 After that it uses tree data structure (where there is fluctuation in capacity). 之后,它使用树数据结构(容量有波动)。 Reason is that accessing tree structure (O(log(n))) is faster than linked list (O(n)). 原因是访问树结构(O(log(n)))比链表(O(n))更快。

在此输入图像描述

This picture shows an inner array of a JAVA 8 HashMap with both trees (at bucket 0) and linked lists (at bucket 1,2 and 3). 此图显示了JAVA 8 HashMap的内部数组,其中包含两个树(在桶0处)和链接列表(在桶1,2和3处)。 Bucket 0 is a Tree because it has more than 8 nodes ( readmore ). Bucket 0是一棵树,因为它有超过8个节点( readmore )。

Documentation on Hashmap and discussion on bucket in hashmap would be helpful in this regard. 在这方面,有关Hashmap的文档以及有关hashmap中存储桶的讨论会很有帮助。

Read the source code of hashmap, 阅读hashmap的源代码,

 /** * The smallest table capacity for which bins may be treeified. * (Otherwise the table is resized if too many nodes in a bin.) * Should be at least 4 * TREEIFY_THRESHOLD to avoid conflicts * between resizing and treeification thresholds. */ static final int MIN_TREEIFY_CAPACITY = 64; 

and you will see 你会看到的

  1. if the capacity does not reach MIN_TREEIFY_CAPACITY(64), and nodes on single bucket reach the TREEIFY_THRESHOLD, now map will resize. 如果容量未达到MIN_TREEIFY_CAPACITY(64),并且单个存储桶上的节点达到TREEIFY_THRESHOLD,则现在地图将调整大小。
  2. if the capacity exceeds MIN_TREEIFY_CAPACITY(64), and nodes on single bucket reach the TREEIFY_THRESHOLD, now map will treeify the nodes on bucket(aka bins in the source code). 如果容量超过MIN_TREEIFY_CAPACITY(64),并且单个存储桶上的节点到达TREEIFY_THRESHOLD,现在map将树形化桶上的节点(源代码中也称为bin)。

Resize and treeify are two operations which can bring map reorganize, and the above decisions based on different scenarios is also a tradeoff. 调整大小和树化是两个可以进行地图重组的操作,基于不同场景的上述决策也是一种权衡。

The simple mathematical formula to calculate loadfactor based on number of bucket array(b) occupied and number of entries(n) in hashmap is n/b.根据 hashmap 中的存储桶数组 (b) 占用数和条目数 (n) 计算负载因子的简单数学公式为 n/b。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM