简体   繁体   中英

c++ unordered_map collision handling , resize and rehash

I have not read the C++ standard but this is how I feel that the unordered_map of c++ suppose to work.

  • Allocate a memory block in the heap.
  • With every put request, hash the object and map it to a space in this memory
  • During this process handle collision handling via chaining or open addressing..

I am quite surprised that I could not find much about how the memory is handled by unordered_map. Is there a specific initial size of memory which unordered_map allocates. What happens if lets say we allocated 50 int memory and we ended up inserting 5000 integer?

This will be lot of collisions so I believe there should be kind of like a re-hashing and re-sizing algorithm to decrease the number of collisions after a certain level of collision threshold is reached. Since they are explicitly provided as member functions to the class, I assume they are used internally as well. Is there a such mechanism?

With every put request, hash the object and map it to a space in this memory

Unfortunately, this isn't exactly true. You are referring to an open addressing or closed hashing data structure which is not how unordered_map is specified.

Every unordered_map implementation stores a linked list to external nodes in the array of buckets. Meaning that inserting an item will always allocate at least once (the new node) if not twice (resizing the array of buckets, then the new node).

No, that is not at all the most efficient way to implement a hash map for most common uses. Unfortunately, a small "oversight" in the specification of unordered_map all but requires this behavior. The required behavior is that iterators to elements must stay valid when inserting or deleting other elements. Because inserting might cause the bucket array to grow (reallocate), it is not generally possible to have an iterator pointing directly into the bucket array and meet the stability guarantees.

unordered_map is a better data structure if you are storing expensive-to-copy items as your key or value. Which makes sense, given that its general design was lifted from Boost's pre-move-semantics design.

Chandler Carruth (Google) mentions this problem in his CppCon '14 talk "Efficiency with Algorithms, Performance with Data Structures" .

std::unordered_map contains a load factor that it uses to manage the size of it's internal buckets. std::unordered_map uses this odd factor to keep the size of the container somewhere in between a 0.0 and 1.0 factor. This decreases the likelihood of a collision in a bucket. After that, I'm not sure if they fallback to linear probing within a bucket that a collision was found in, but I would assume so.

Allocate a memory block in the heap.

True - there's a block of memory for an array of "buckets", which in the case of GCC are actually iterators capable of recording a place in a forward-linked list.

With every put request, hash the object and map it to a space in this memory

No... when you insert/emplace further items into the list, an additional dynamic (ie heap) allocation is done with space for the node's next link and the value being inserted/emplaced. The linked list is rewired accordingly, so the newly inserted element is linked to and/or from the other elements that hashed to the same bucket, and if other buckets also have elements, that group will be linked to and/or from the nodes for those elements.

At some point, the hash table content might look like this (GCC does things this way, but it's possible to do something simpler):

           +------->  head
          /            |
bucket#  /            #503
[0]----\/              |
[1]    /\      /===> #1003
[2]===/==\====/        |
[3]--/    \     /==>  #22
[4]        \   /       |
[5]         \ /        #7
[6]          \         |
[7]=========/ \-----> #177
[8]                    |
[9]                   #100
                   
  • The buckets on the left are the array from the original allocation: there are 10 elements in the illustrated array, so "bucket_count()" == 10.

  • A key with hash value X - denoted # x eg #177 - hashes to bucket X % bucket_count() ; that bucket will need to store an iterator to the singly-linked list element immediately before the first element hashing to that bucket, so it can remove the last element from the bucket and rewire either head, or another bucket's next pointer, to skip over the erased element.

  • While elements in a bucket need to be contiguous in the forward-linked list, the ordering of buckets within that list is an unimportant consequence of the order of insertion of elements in the container, and isn't stipulated in the Standard.

During this process handle collision handling via chaining or open addressing..

The Standard library containers that are backed by hash tables always use separate chaining .

I am quite surprised that I could not find much about how the memory is handled by unordered_map. Is there a specific initial size of memory which unordered_map allocates.

No, the C++ Standard doesn't dictate what the initial memory allocation should be; it's up to the C++ implementation to choose. You can see how many buckets a newly created table has by printing out .bucket_count() , and in all likelihood if you multiply that by the your pointer size you'll get the size of the heap allocation that the unordered container made: myUnorderedContainer.bucket_count() * sizeof(int*) . That said, there's no prohibition on your Standard Library implementation varying the initial bucket_count() in arbitrary and bizarre ways (eg with optimisation level, depending on Key type), but I can't imagine why any would.

What happens if lets say we allocated 50 int memory and we ended up inserting 5000 integer? This will be lot of collisions so I believe there should be kind of like a re-hashing and re-sizing algorithm to decrease the number of collisions after a certain level of collision threshold is reached.

Rehashing/resizing isn't triggered by a certain number of collisions, but a certain proneness for collisions, as measured by the load factor , which is .size() / .bucket_count() .

When an insertion would push the .load_factor() above the .max_load_factor() , which you can change but is required by the C++ Standard to default to 1.0, then the hash table is resized. That effectively means it allocates more buckets - normally somewhere close to but not necessarily exactly twice as many - then it points the new buckets at the linked list nodes, then finally deletes the heap allocation with the old buckets.

Since they are explicitly provided as member functions to the class, I assume they are used internally as well. Is there a such mechanism?

There's is no C++ Standard requirement about how the resizing is implemented. That said, if I were implementing resize() I'd consider creating a function-local container whilst specifying the newly desired bucket_count , then iterate over the elements in the *this object, calling extract() to detach them, thenmerge() to add them to the function-local container object, then eventually invoke swap on *this and the function-local container.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM