简体   繁体   English

std::unordered_map 的桶数意外增长

[英]Number of buckets of std::unordered_map grows unexpectedly

I'd like to use std::unordered map as a software cache with a limited capacity.我想使用std::unordered映射作为容量有限的软件缓存。 Namely, I set the number of buckets in the constructor (doesn't mind that it might become actually larger) and insert new data (if not already there) if the following way:也就是说,我在构造函数中设置桶的数量(不介意它实际上可能变得更大)并插入新数据(如果还没有),如果以下方式:

  1. If the bucket where the data belong is not empty, I replace its node with the inserted data (by C++17 extraction-insertion pattern).如果数据所属的桶不为空,我将其节点替换为插入的数据(通过 C++17 提取-插入模式)。
  2. Otherwise, I simply insert data.否则,我只是插入数据。

The minimal example that simulates this approach is as follows:模拟这种方法的最小示例如下:

#include <iostream>
#include <unordered_map>

std::unordered_map<int, int> m(2);

void insert(int a) {
   auto idx = m.bucket(a);
   if (m.bucket_size(idx) > 0) {
      const auto& key = m.begin(idx)->first;
      auto nh = m.extract(key);
      nh.key() = a;
      nh.mapped() = a;
      m.insert(std::move(nh));
   }
   else
      m.insert({a, a});
}

int main() {
   for (int i = 0; i < 1000; i++) {
      auto bc1 = m.bucket_count();
      insert(i);
      auto bc2 = m.bucket_count();
      if (bc1 != bc2) std::cerr << bc2 << std::endl;
   }
}

The problem is, that with GCC 8.1 (that is available for me in the production environment), the bucket count is not fixed and grows instead;问题是,使用 GCC 8.1(我可以在生产环境中使用),桶数不是固定的而是增长的; the output reads:输出如下:

7
17
37
79 
167
337
709
1493

Live demo: https://wandbox.org/permlink/c8nnEU52NsWarmuD现场演示: https : //wandbox.org/permlink/c8nnEU52NsWarmuD

Updated info: the bucket count is always increased in the else branch: https://wandbox.org/permlink/p2JaHNP5008LGIpL .更新信息: else分支中的桶数总是增加: https : //wandbox.org/permlink/p2JaHNP5008LGIpL

However, when I use GCC 9.1 or Clang 8.0, the bucket count remains fixed (no output is printed in the error stream).但是,当我使用 GCC 9.1 或 Clang 8.0 时,桶计数保持固定(错误流中没有输出输出)。

My question is whether this is a bug in the older version of libstdc++, or my approach isn't correct and I cannot use std::unordered_map this way.我的问题是这是否是旧版本的 libstdc++ 中的错误,或者我的方法不正确,我不能以这种方式使用std::unordered_map


Moreover, I found out that the problem disappears when I set the max_load_factor to some very high number, such as此外,我发现当我将max_load_factor设置为一些非常高的数字时,问题就消失了,例如

m.max_load_factor(1e20f);

But I don't want to rely on such a "fragile" solution in the production code.但我不想在生产代码中依赖这种“脆弱”的解决方案。

Unfortunately the problem you're having appears to be a bug in older implementations of std::unordered_map .不幸的是,您遇到的问题似乎是std::unordered_map旧实现中的错误。 This problem disappears in g++-9, but if you're limited to g++-8, I recommend rolling your own hash-cache.这个问题在 g++-9 中消失了,但如果你仅限于 g++-8,我建议你滚动你自己的哈希缓存。

Rolling our own hash-cache滚动我们自己的哈希缓存

Thankfully, the type of cache you want to write is actually simpler than writing a full hash-table, mainly because it's fine if values occasionally get dropped from the table.值得庆幸的是,您想要写入的缓存类型实际上比写入完整的哈希表更简单,主要是因为偶尔从表中删除值也没关系。 To see how difficult it'd be, I wrote my own version.为了看看它有多难,我写了我自己的版本。

So what's it look like?那么它是什么样子的呢?

Let's say you have an expensive function you want to cache.假设您有一个要缓存的昂贵函数。 The fibbonacci function, when written using the recursive implementation, is notorious for requiring exponential time in terms of the input because it calls itself. fibbonacci 函数在使用递归实现编写时,因需要输入的指数时间而臭名昭著,因为它调用自身。

// Uncached version

long long fib(int n) {
    if(n <= 1)
        return n;
    else
        return fib(n - 1) + fib(n - 2); 
}

Let's transform it to the cached version, using the Cache class which I'll show you in a moment.让我们使用Cache类将其转换为缓存版本,稍后我将向您展示。 We actually only need to add one line of code to the function:我们实际上只需要在函数中添加一行代码:

// Cached version; much faster

long long fib(int n) {
    static auto fib = Cache(::fib, 1024); // fib now refers to the cache, instead of the enclosing function
    if(n <= 1)
        return n;
    else
        return fib(n - 1) + fib(n - 2);   // Invokes cache
}

The first argument is the function you want to cache (in this case, fib itself), and the second argument is the capacity.第一个参数是您要缓存的函数(在本例中为fib本身),第二个参数是容量。 For n == 40 , the uncached version takes 487,000 microseconds to run.对于n == 40 ,未缓存的版本需要 487,000 微秒才能运行。 And the cached version?和缓存版本? Just 16 microseconds to initialize the cache, fill it, and return the value!只需 16 微秒即可初始化缓存、填充它并返回值! You can see it run here.你可以看到它在这里运行。 . . After that initial access, retrieving a stored value from the cache takes around 6 nanoseconds .在初始访问之后,从缓存中检索存储的值大约需要6 纳秒

(If Compiler Explorer shows the assembly instead of the output, click on the tab next to it.) (如果 Compiler Explorer 显示程序集而不是输出,请单击它旁边的选项卡。)

How would we write this Cache class?我们将如何编写这个Cache类?

Here's a compact implementation of it.这是它的紧凑实现。 The Cache class stores the following Cache类存储以下内容

  • An array of bools, which keeps track of which buckets have values一个 bool 数组,用于跟踪哪些桶具有值
  • An array of keys一组键
  • An array of values一组值
  • A bitmask & hash function位掩码和哈希函数
  • A function to calculate values that aren't in the table用于计算不在表中的值的函数

In order to calculate a value, we:为了计算一个值,我们:

  • Check if the key is stored in the table检查密钥是否存储在表中
  • If the key is not in the table, calculate and store the value如果键不在表中,计算并存储值
  • Return the stored value返回存储的值

Here's the code:这是代码:

template<class Key, class Value, class Func>
class Cache {
    static size_t calc_mask(size_t min_cap) {
        size_t actual_cap = 1;
        while(actual_cap <= min_cap) {
            actual_cap *= 2;
        }
        return actual_cap - 1; 
    }
    size_t mask = 0;
    std::unique_ptr<bool[]> isEmpty; 
    std::unique_ptr<Key[]> keys;
    std::unique_ptr<Value[]> values;
    std::hash<Key> hash;
    Func func; 
   public:
    Cache(Cache const& c) 
      : mask(c.mask)
      , isEmpty(new bool[mask + 1])
      , keys(new Key[mask + 1])
      , values(new Value[mask + 1])
      , hash(c.hash)
      , func(c.func)
    {
        std::copy_n(c.isEmpty.get(), capacity(), isEmpty.get());
        std::copy_n(c.keys.get(), capacity(), keys.get());
        std::copy_n(c.values.get(), capacity(), values.get());
    }
    Cache(Cache&&) = default; 
    Cache(Func func, size_t cap)
      : mask(calc_mask(cap))
      , isEmpty(new bool[mask + 1])
      , keys(new Key[mask + 1])
      , values(new Value[mask + 1])
      , hash()
      , func(func) {
        std::fill_n(isEmpty.get(), capacity(), true); 
    }
    Cache(Func func, size_t cap, std::hash<Key> const& hash)
      : mask(calc_mask(cap))
      , isEmpty(new bool[mask + 1])
      , keys(new Key[mask + 1])
      , values(new Value[mask + 1])
      , hash(hash)
      , func(func) {
        std::fill_n(isEmpty.get(), capacity(), true); 
    }
    

    Value operator()(Key const& key) const {
        size_t index = hash(key) & mask;
        auto& value = values[index]; 
        auto& old_key = keys[index]; 
        if(isEmpty[index] || old_key != key) {
            old_key = key; 
            value = func(key); 
            isEmpty[index] = false; 
        }
        return value;
    }
    size_t capacity() const {
        return mask + 1;
    }
};
template<class Key, class Value>
Cache(Value(*)(Key), size_t) -> Cache<Key, Value, Value(*)(Key)>; 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM