简体   繁体   English

HashMap中负载因子的意义是什么?

[英]What is the significance of load factor in HashMap?

HashMap has two important properties: size and load factor . HashMap有两个重要的属性: sizeload factor I went through the Java documentation and it says 0.75f is the initial load factor.我浏览了 Java 文档,它说0.75f是初始负载因子。 But I can't find the actual use of it.但是我找不到它的实际用途。

Can someone describe what are the different scenarios where we need to set load factor and what are some sample ideal values for different cases?有人可以描述我们需要设置负载因子的不同场景以及不同情况下的一些样本理想值吗?

Thedocumentation explains it pretty well: 文档很好地解释了它:

An instance of HashMap has two parameters that affect its performance: initial capacity and load factor. HashMap 的实例有两个影响其性能的参数:初始容量和负载因子。 The capacity is the number of buckets in the hash table, and the initial capacity is simply the capacity at the time the hash table is created.容量是哈希表中的桶数,初始容量就是哈希表创建时的容量。 The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased.负载因子是衡量哈希表在其容量自动增加之前允许达到多满的指标。 When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets.当哈希表中的条目数超过负载因子和当前容量的乘积时,重新哈希表(即重建内部数据结构),使哈希表具有大约两倍的桶数。

As a general rule, the default load factor (.75) offers a good tradeoff between time and space costs.作为一般规则,默认负载因子 (.75) 在时间和空间成本之间提供了很好的权衡。 Higher values decrease the space overhead but increase the lookup cost (reflected in most of the operations of the HashMap class, including get and put).较高的值会减少空间开销,但会增加查找成本(反映在 HashMap 类的大多数操作中,包括 get 和 put)。 The expected number of entries in the map and its load factor should be taken into account when setting its initial capacity, so as to minimize the number of rehash operations.在设置其初始容量时,应考虑映射中的预期条目数及其负载因子,以尽量减少重新哈希操作的次数。 If the initial capacity is greater than the maximum number of entries divided by the load factor, no rehash operations will ever occur.如果初始容量大于最大条目数除以负载因子,则不会发生重新哈希操作。

As with all performance optimizations, it is a good idea to avoid optimizing things prematurely (ie without hard data on where the bottlenecks are).与所有性能优化一样,最好避免过早优化(即没有关于瓶颈位置的硬数据)。

Default initial capacity of the HashMap takes is 16 and load factor is 0.75f (ie 75% of current map size). HashMap默认初始容量为 16,加载因子为 0.75f(即当前地图大小的 75%)。 The load factor represents at what level the HashMap capacity should be doubled.负载因子表示HashMap容量应该在什么级别增加一倍。

For example product of capacity and load factor as 16 * 0.75 = 12 .例如容量和负载系数的乘积为16 * 0.75 = 12 This represents that after storing the 12th key – value pair into the HashMap , its capacity becomes 32.这表示将第 12 个键值对存储到HashMap ,其容量变为 32。

Actually, from my calculations, the "perfect" load factor is closer to log 2 (~ 0.7).实际上,根据我的计算,“完美”负载因子更接近 log 2 (~ 0.7)。 Although any load factor less than this will yield better performance.尽管任何小于此的负载因子都会产生更好的性能。 I think that .75 was probably pulled out of a hat.我认为 0.75 可能是从帽子里拿出来的。

Proof:证明:

Chaining can be avoided and branch prediction exploited by predicting if a bucket is empty or not.通过预测桶是否为空,可以避免链接并利用分支预测。 A bucket is probably empty if the probability of it being empty exceeds .5.如果一个桶是空的概率超过 0.5,它可能是空的。

Let s represent the size and n the number of keys added.让 s 表示大小,n 表示添加的键数。 Using the binomial theorem, the probability of a bucket being empty is:使用二项式定理,桶为空的概率为:

P(0) = C(n, 0) * (1/s)^0 * (1 - 1/s)^(n - 0)

Thus, a bucket is probably empty if there are less than因此,如果桶少于

log(2)/log(s/(s - 1)) keys

As s reaches infinity and if the number of keys added is such that P(0) = .5, then n/s approaches log(2) rapidly:随着 s 达到无穷大并且如果添加的键数使得 P(0) = .5,则 n/s 迅速接近 log(2):

lim (log(2)/log(s/(s - 1)))/s as s -> infinity = log(2) ~ 0.693...

What is load factor ?什么是负载因子?

The amount of capacity which is to be exhausted for the HashMap to increase its capacity ? HashMap 增加容量需要消耗多少容量?

Why load factor ?为什么是负载因子?

Load factor is by default 0.75 of the initial capacity (16) therefore 25% of the buckets will be free before there is an increase in the capacity & this makes many new buckets with new hashcodes pointing to them to exist just after the increase in the number of buckets.默认情况下,负载因子是初始容量 (16) 的 0.75,因此在容量增加之前 25% 的桶将是空闲的,这使得许多带有新哈希码的新桶在容量增加后立即存在桶数。

Now why should you keep many free buckets & what is the impact of keeping free buckets on the performance ?现在为什么要保留许多空闲桶?保留空闲桶对性能有什么影响?

If you set the loading factor to say 1.0 then something very interesting might happen.如果您将加载因子设置为 1.0,那么可能会发生一些非常有趣的事情。

Say you are adding an object x to your hashmap whose hashCode is 888 & in your hashmap the bucket representing the hashcode is free , so the object x gets added to the bucket, but now again say if you are adding another object y whose hashCode is also 888 then your object y will get added for sure BUT at the end of the bucket ( because the buckets are nothing but linkedList implementation storing key,value & next ) now this has a performance impact !假设您将一个对象 x 添加到您的哈希图中,其 hashCode 为 888 & 在您的哈希图中,表示哈希码的存储桶是空闲的,因此对象 x被添加到存储桶中,但现在再次说明您是否要添加另一个哈希码为 y 的对象也是 888 那么你的对象 y 肯定会被添加到存储桶的末尾(因为存储桶只不过是存储键、值和下一个的链表实现),现在这会对性能产生影响! Since your object y is no longer present in the head of the bucket if you perform a lookup the time taken is not going to be O(1) this time it depends on how many items are there in the same bucket.由于如果您执行查找,您的对象 y不再存在于桶的头部,因此这次花费的时间不会是O(1),这取决于同一个桶中有多少项目。 This is called hash collision by the way & this even happens when your loading factor is less than 1.顺便说一下,这称为哈希冲突,甚至在您的加载因子小于 1 时也会发生这种情况。

Correlation between performance , hash collision & loading factor ?性能、哈希冲突和加载因子之间的相关性?

Lower load factor = more free buckets = less chances of collision = high performance = high space requirement.更低的负载系数= 更多的空闲桶 =更少的碰撞机会= 高性能 = 高空间需求。

Correct me if i am wrong somewhere.如果我在某处错了,请纠正我。

From the documentation :文档

The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased负载因子是衡量哈希表在其容量自动增加之前允许达到多满的指标

It really depends on your particular requirements, there's no "rule of thumb" for specifying an initial load factor.这实际上取决于您的特定要求,没有用于指定初始负载因子的“经验法则”。

For HashMap DEFAULT_INITIAL_CAPACITY = 16 and DEFAULT_LOAD_FACTOR = 0.75f it means that MAX number of ALL Entries in the HashMap = 16 * 0.75 = 12 .对于 HashMap DEFAULT_INITIAL_CAPACITY = 16DEFAULT_LOAD_FACTOR = 0.75f这意味着 HashMap 中所有条目的最大数量 = 16 * 0.75 = 12 When the thirteenth element will be added capacity (array size) of HashMap will be doubled!当第 13 个元素被添加时,HashMap 的容量(数组大小)将增加一倍! Perfect illustration answered this question:完美插图回答了这个问题: 在此处输入图片说明 image is taken from here:图像取自此处:

https://javabypatel.blogspot.com/2015/10/what-is-load-factor-and-rehashing-in-hashmap.html https://javabypatel.blogspot.com/2015/10/what-is-load-factor-and-rehashing-in-hashmap.html

If the buckets get too full, then we have to look through如果桶装得太满,那么我们必须查看

a very long linked list.一个很长的链表。

And that's kind of defeating the point.这有点违背了这一点。

So here's an example where I have four buckets.所以这里有一个例子,我有四个桶。

I have elephant and badger in my HashSet so far.到目前为止,我的 HashSet 中有大象和獾。

This is a pretty good situation, right?这是一个很好的情况,对吧?

Each element has zero or one elements.每个元素有零个或一个元素。

Now we put two more elements into our HashSet.现在我们将另外两个元素放入我们的 HashSet 中。

     buckets      elements
      -------      -------
        0          elephant
        1          otter
         2          badger
         3           cat

This isn't too bad either.这也不算太糟。

Every bucket only has one element .每个桶只有一个元素。 So if I wanna know, does this contain panda?所以如果我想知道,这里面有熊猫吗?

I can very quickly look at bucket number 1 and it's not我可以很快地查看 1 号桶,但它不是

there and那里和

I known it's not in our collection.我知道它不在我们的收藏中。

If I wanna know if it contains cat, I look at bucket如果我想知道它是否包含猫,我会看桶

number 3, 3号,

I find cat, I very quickly know if it's in our我找到了猫,我很快就知道它是否在我们的

collection.收藏。

What if I add koala, well that's not so bad.如果我添加考拉,那还不错。

             buckets      elements
      -------      -------
        0          elephant
        1          otter -> koala 
         2          badger
         3           cat

Maybe now instead of in bucket number 1 only looking at也许现在而不是在第 1 个桶中只看

one element,一个元素,

I need to look at two.我需要看两个。

But at least I don't have to look at elephant, badger and但至少我不用看大象、獾和

cat.猫。

If I'm again looking for panda, it can only be in bucket如果我再次寻找熊猫,它只能在桶中

number 1 and 1 号和

I don't have to look at anything other then otter and除了水獭,我不需要看任何东西

koala.考拉。

But now I put alligator in bucket number 1 and you can但现在我把鳄鱼放在 1 号桶里,你可以

see maybe where this is going.看看这是怎么回事。

That if bucket number 1 keeps getting bigger and bigger and如果 1 号桶越来越大

bigger, then I'm basically having to look through all of更大,那么我基本上不得不浏览所有

those elements to find那些要找到的元素

something that should be in bucket number 1.应该在 1 号桶中的东西。

            buckets      elements
      -------      -------
        0          elephant
        1          otter -> koala ->alligator
         2          badger
         3           cat

If I start adding strings to other buckets,如果我开始向其他存储桶添加字符串,

right, the problem just gets bigger and bigger in every是的,问题只会越来越大

single bucket.单桶。

How do we stop our buckets from getting too full?我们如何阻止我们的桶装得太满?

The solution here is that这里的解决方案是

          "the HashSet can automatically

        resize the number of buckets."

There's the HashSet realizes that the buckets are getting HashSet 意识到桶正在获取

too full.太满了。

It's losing this advantage of this all of one lookup for它正在失去所有一次查找的优势

elements.元素。

And it'll just create more buckets(generally twice as before) and它只会创建更多的桶(通常是以前的两倍)和

then place the elements into the correct bucket.然后将元素放入正确的桶中。

So here's our basic HashSet implementation with separate所以这是我们的基本 HashSet 实现

chaining.链接。 Now I'm going to create a "self-resizing HashSet".现在我要创建一个“自我调整大小的 HashSet”。

This HashSet is going to realize that the buckets are这个 HashSet 将意识到桶是

getting too full and吃得太饱

it needs more buckets.它需要更多的桶。

loadFactor is another field in our HashSet class. loadFactor 是我们 HashSet 类中的另一个字段。

loadFactor represents the average number of elements per loadFactor 表示每个元素的平均数量

bucket,桶,

above which we want to resize.超过我们要调整大小。

loadFactor is a balance between space and time. loadFactor 是空间和时间之间的平衡。

If the buckets get too full then we'll resize.如果桶太满,那么我们将调整大小。

That takes time, of course, but这当然需要时间,但是

it may save us time down the road if the buckets are a如果水桶是一个,它可以节省我们在路上的时间

little more empty.多一点空。

Let's see an example.让我们看一个例子。

Here's a HashSet, we've added four elements so far.这是一个 HashSet,到目前为止我们已经添加了四个元素。

Elephant, dog, cat and fish.大象、狗、猫和鱼。

          buckets      elements
      -------      -------
        0          
        1          elephant
         2          cat ->dog
         3           fish
          4         
           5

At this point, I've decided that the loadFactor, the在这一点上,我已经决定 loadFactor,

threshold,临界点,

the average number of elements per bucket that I'm okay我没事的每个桶的平均元素数

with, is 0.75.与,是 0.75。

The number of buckets is buckets.length, which is 6, and桶数为buckets.length,即6,

at this point our HashSet has four elements, so the此时我们的 HashSet 有四个元素,所以

current size is 4.当前大小为 4。

We'll resize our HashSet, that is we'll add more buckets,我们将调整我们的 HashSet 的大小,也就是说我们将添加更多的桶,

when the average number of elements per bucket exceeds当每个桶的平均元素数超过

the loadFactor.负载因子。

That is when current size divided by buckets.length is那就是当当前大小除以 buckets.length 是

greater than loadFactor.大于 loadFactor。

At this point, the average number of elements per bucket此时,每个bucket的平均元素数

is 4 divided by 6.是 4 除以 6。

4 elements, 6 buckets, that's 0.67. 4 个元素,6 个桶,即 0.67。

That's less than the threshold I set of 0.75 so we're这小于我设置的阈值 0.75 所以我们

okay.好的。

We don't need to resize.我们不需要调整大小。

But now let's say we add woodchuck.但是现在假设我们添加了土拨鼠。

                  buckets      elements
      -------      -------
        0          
        1          elephant
         2        woodchuck-> cat ->dog
         3           fish
          4         
           5

Woodchuck would end up in bucket number 3.土拨鼠最终会出现在第 3 个桶中。

At this point, the currentSize is 5.此时,currentSize 为 5。

And now the average number of elements per bucket现在每个桶的平均元素数

is the currentSize divided by buckets.length.是 currentSize 除以 buckets.length。

That's 5 elements divided by 6 buckets is 0.83. 5 个元素除以 6 个桶是 0.83。

And this exceeds the loadFactor which was 0.75.这超过了 0.75 的 loadFactor。

In order to address this problem, in order to make the为了解决这个问题,为了使

buckets perhaps a little水桶也许有点

more empty so that operations like determining whether a更空,以便像确定一个

bucket contains存储桶包含

an element will be a little less complex, I wanna resize元素会稍微简单一点,我想调整大小

my HashSet.我的哈希集。

Resizing the HashSet takes two steps.调整 HashSet 的大小需要两个步骤。

First I'll double the number of buckets, I had 6 buckets,首先我将桶的数量加倍,我有 6 个桶,

now I'm going to have 12 buckets.现在我将有 12 个桶。

Note here that the loadFactor which I set to 0.75 stays the same.请注意,我设置为 0.75 的 loadFactor 保持不变。

But the number of buckets changed is 12,但是改变的桶数是12,

the number of elements stayed the same, is 5.元素数量保持不变,为 5。

5 divided by 12 is around 0.42, that's well under our 5 除以 12 约为 0.42,这远低于我们的

loadFactor,负载因子,

so we're okay now.所以我们现在没事了。

But we're not done because some of these elements are in但我们还没有完成,因为其中一些元素在

the wrong bucket now.现在错误的存储桶。

For instance, elephant.以大象为例。

Elephant was in bucket number 2 because the number of大象在 2 号桶中,因为

characters in elephant大象中的人物

was 8.是 8。

We have 6 buckets, 8 minus 6 is 2.我们有 6 个桶,8 减 6 是 2。

That's why it ended up in number 2.这就是为什么它最终排在第 2 位。

But now that we have 12 buckets, 8 mod 12 is 8, so但是现在我们有 12 个桶,8 mod 12 是 8,所以

elephant does not belong in bucket number 2 anymore.大象不再属于 2 号桶。

Elephant belongs in bucket number 8.大象属于第 8 号桶。

What about woodchuck?土拨鼠呢?

Woodchuck was the one that started this whole problem.土拨鼠是整个问题的始作俑者。

Woodchuck ended up in bucket number 3.土拨鼠最终排在第 3 个桶中。

Because 9 mod 6 is 3.因为 9 mod 6 是 3。

But now we do 9 mod 12.但是现在我们做 9 mod 12。

9 mod 12 is 9, woodchuck goes to bucket number 9. 9 mod 12 是 9,土拨鼠去 9 号桶。

And you see the advantage of all this.你会看到这一切的好处。

Now bucket number 3 only has two elements whereas before it had 3.现在 3 号桶只有两个元素,而之前它有 3 个元素。

So here's our code,所以这是我们的代码,

where we had our HashSet with separate chaining that我们的 HashSet 带有单独的链接

didn't do any resizing.没有做任何调整大小。

Now, here's a new implementation where we use resizing.现在,这是我们使用调整大小的新实现。

Most of this code is the same,这段代码大部分是一样的,

we're still going to determine whether it contains the我们仍将确定它是否包含

value already.已经值了。

If it doesn't, then we'll figure it out which bucket it如果没有,那么我们会弄清楚它是哪个桶

should go into and应该进入和

then add it to that bucket, add it to that LinkedList.然后将其添加到该存储桶,将其添加到该 LinkedList。

But now we increment the currentSize field.但是现在我们增加 currentSize 字段。

currentSize was the field that kept track of the number currentSize 是跟踪数字的字段

of elements in our HashSet.我们 HashSet 中的元素。

We're going to increment it and then we're going to look我们要增加它,然后我们要看看

at the average load,在平均负载下,

the average number of elements per bucket.每个桶的平均元素数。

We'll do that division down here.我们将在这里进行划分。

We have to do a little bit of casting here to make sure我们必须在这里做一些铸造以确保

that we get a double.我们得到一个双倍。

And then, we'll compare that average load to the field然后,我们会将平均负载与现场进行比较

that I've set as我已经设置为

0.75 when I created this HashSet, for instance, which was例如,当我创建这个 HashSet 时,它是 0.75

the loadFactor.负载因子。

If the average load is greater than the loadFactor,如果平均负载大于 loadFactor,

that means there's too many elements per bucket on这意味着每个桶上的元素太多

average, and I need to reinsert.平均,我需要重新插入。

So here's our implementation of the method to reinsert所以这是我们重新插入方法的实现

all the elements.所有的元素。

First, I'll create a local variable called oldBuckets.首先,我将创建一个名为 oldBuckets 的局部变量。

Which is referring to the buckets as they currently stand这是指目前站立的水桶

before I start resizing everything.在我开始调整所有内容之前。

Note I'm not creating a new array of linked lists just yet.注意我还没有创建一个新的链表数组。

I'm just renaming buckets as oldBuckets.我只是将桶重命名为 oldBuckets。

Now remember buckets was a field in our class, I'm going现在记住水桶是我们班上的一个领域,我要去

to now create a new array现在创建一个新数组

of linked lists but this will have twice as many elements链表,但这将有两倍的元素

as it did the first time.就像第一次那样。

Now I need to actually do the reinserting,现在我需要重新插入,

I'm going to iterate through all of the old buckets.我将遍历所有旧存储桶。

Each element in oldBuckets is a LinkedList of strings oldBuckets 中的每个元素都是字符串的 LinkedList

that is a bucket.那是一个桶。

I'll go through that bucket and get each element in that我将通过那个桶并获取其中的每个元素

bucket.桶。

And now I'm gonna reinsert it into the newBuckets.现在我要把它重新插入到 newBuckets 中。

I will get its hashCode.我会得到它的哈希码。

I will figure out which index it is.我会弄清楚它是哪个索引。

And now I get the new bucket, the new LinkedList of现在我得到了新的存储桶,新的 LinkedList

strings and字符串和

I'll add it to that new bucket.我会将它添加到那个新存储桶中。

So to recap, HashSets as we've seen are arrays of Linked回顾一下,我们看到的 HashSets 是 Linked 的数组

Lists, or buckets.列表或桶。

A self resizing HashSet can realize using some ratio or一个自调整大小的 HashSet 可以使用一些比率或

我会选择一个大小为 n * 1.5 或 n + (n >> 1) 的表,这将使负载因子为 0.66666~ 无除法,这在大多数系统上都很慢,尤其是在没有除法的便携式系统上硬件。

完整了解负载因子和重新散列就在这里

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM