简体   繁体   中英

Predict resize/rehash of std::unordered_set or std::unordered_map

Is it possible to reliably predict when an insert to std::unordered_set or std::unordered_map will resize the underlying storage and rehash the items?

My program maintains an unordered_set of items that constantly grows, but some items may become 'expired', and I can remove those from the set to save space. A good time to do it is just before inserting an item, in case the insert will cause the set to resize and rehash. The set will anyway need to scan all of its elements and I may even prevent it from resizing).

But so far I did not find a way to predict a resize that will work across implementations of the standard library. The code below exposes differences between Microsoft's implementation and libstdc++.

std::unordered_set<int> set;
for (int i=0; i < 1000; ++i) {
    size_t bucketsBefore = set.bucket_count();
    set.emplace(i);
    size_t bucketsAfter = set.bucket_count();
    bool resized = bucketsAfter > bucketsBefore;
    if (resized)
        printf("Size from %zu to %zu, buckets from %zu to %zu.\n", set.size() - 1, set.size(), bucketsBefore, bucketsAfter);
}

When compiled with MSVC in Windows, this prints

Size from 8 to 9, buckets from 8 to 64.
Size from 64 to 65, buckets from 64 to 512.
Size from 512 to 513, buckets from 512 to 1024.

When compiled with g++ in Linux, this prints

Size from 0 to 1, buckets from 1 to 3.
Size from 2 to 3, buckets from 3 to 7.
Size from 6 to 7, buckets from 7 to 17.
Size from 16 to 17, buckets from 17 to 37.
Size from 36 to 37, buckets from 37 to 79.
Size from 78 to 79, buckets from 79 to 167.
Size from 166 to 167, buckets from 167 to 337.
Size from 336 to 337, buckets from 337 to 709.
Size from 708 to 709, buckets from 709 to 1493.

In terms of load factor this means that Microsoft implementation will resize the set when load factor would exceed 1, but libstdc++ -- when load factor reaches 1.

Now I'm wondering what's a good way around it. There are options.

  1. Remove expired items after a resize. The more robust option, but this way you can never prevent a resize. That's what I do now.
  2. Remove expired items when libstdc++ would perform the resize. Not too bad of an idea, but if there exists a third implementation that would resize even earlier, eg, when load factor reaches 1-epsilon, then for that implementation I would never remove expired items. Given that Microsoft and libstdc++ already treat the load factor differently, I don't see a reason why such third implementation may not appear. Or is there a reason?

您可以考虑使用boost::intrusive::unordered_set并根据负载因子和expired项目的数量重新哈希。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM