unordered_map 對 hash function 的過多調用

Question

以下代碼導致對 hash function 的無法解釋的調用：

namespace foo {
    using Position = tuple <int, int, int>;
    
    std::ostream& operator<<(std::ostream& out, const Position& pos) noexcept{
        return out << get<0>(pos) << ", " << get<1>(pos) << ", " << get<2>(pos);
    }

    struct hashFunc{
        std::size_t operator()(const Position& pos) const noexcept{
            int res = get<0>(pos) * 17 ^ get<1>(pos) * 11 ^ get<2>(pos);
            cout << "@@@ hash function called for key: " << pos 
                 << ", hash: " << res << endl;
            return res;
        }
    };

    template<typename T>
    void print_buckets(T&& map) {
        auto num_buckets = map.bucket_count();
        cout << "------------------------------" << endl;
        cout << "NUM BUCKETS: " << num_buckets << endl;
        for(size_t i=0; i<num_buckets; ++i) {
            auto bucket_size = map.bucket_size(i);
            if(bucket_size) {
                cout << "BUCKET " << i << " size: " << bucket_size << endl;        
            }
        }
        cout << "------------------------------" << endl;
    }
}

主要的：

using namespace foo;

int main() {
    // note: bucket_count specified
    unordered_map <Position, std::string, hashFunc> test(10); 
    
    auto x = tuple{1,0,0};
    auto z = tuple{0,1,0};
    auto w = tuple{0,0,1};
            
    cout << "==================================" << endl;
    cout << "about to insert: " << x << endl;
    test[x] =  "hello";
    print_buckets(test);
    cout << "after insert of: " << x << endl;
    
    cout << "==================================" << endl;
    cout << "about to insert: " << z << endl;
    test[z] = "hey";
    print_buckets(test);
    cout << "after insert of: " << z << endl;
    
    cout << "==================================" << endl;
    cout << "about to insert: " << w << endl;
    test.insert({w, "hello"});
    print_buckets(test);
    cout << "after insert of: " << w << endl;    
    cout << "==================================" << endl;
}

Output：

==================================
about to insert: 1, 0, 0
@@@ hash function called for key: 1, 0, 0, hash: 17
------------------------------
NUM BUCKETS: 11
BUCKET 6 size: 1
------------------------------
after insert of: 1, 0, 0
==================================
about to insert: 0, 1, 0
@@@ hash function called for key: 0, 1, 0, hash: 11
@@@ hash function called for key: 1, 0, 0, hash: 17   <= why?
------------------------------
NUM BUCKETS: 11
@@@ hash function called for key: 1, 0, 0, hash: 17   <= why?
BUCKET 0 size: 1
BUCKET 6 size: 1
------------------------------
after insert of: 0, 1, 0
==================================
about to insert: 0, 0, 1
@@@ hash function called for key: 0, 0, 1, hash: 1
@@@ hash function called for key: 0, 1, 0, hash: 11   <= why?
------------------------------
NUM BUCKETS: 11
@@@ hash function called for key: 1, 0, 0, hash: 17   <= why?
BUCKET 0 size: 1
@@@ hash function called for key: 0, 1, 0, hash: 11   <= why?
BUCKET 1 size: 1
BUCKET 6 size: 1
------------------------------
after insert of: 0, 0, 1
==================================

代碼（gcc 和 clang 的行為相同）

_筆記：

_{1. 在沒有構造函數的bucket_count參數的情況下嘗試相同的操作，由於重新散列，對 hash function 的調用變得更加過度。} _{但在上面的場景中，似乎沒有重新散列，也沒有碰撞。}

_{2. 相關，但特別是在 MSVC 上：插入到 std::unordered_map 調用 hash function 在 MSVC++ 的 Z2523E0C272CB676C4F59F9782894F14Z 中兩次調用，錯誤的設計或特殊原因或特殊原因}

Answer 1

首先，有幾點觀察：

無序的 map 既是 hash 表，又是單鏈表。
請參見此處begin返回一個iterator ，該迭代器對LegacyForwardIterator進行建模。
將條目插入 map 需要更新 hash 表和鏈表。

其次，關於這些容器的實施決策的幾點說明：

對於單鏈表，通常會有一個不包含任何數據的哨兵節點（對於像Node<T>這樣的東西，它仍然會有一個T ，只是默認初始化）。 我們只需要它的next指針，因為它有助於保持列表操作的正常性（即，我們不必將insert-at-the-head和insert-after-node寫成不同的特殊情況）。
對於 hash 表（假設鏈表存儲桶，因為它是標准要求的），我們可以使用Node table[N] （因此每個存儲桶都有自己的預分配哨兵）或Node* table[N] 。
在這種情況下，由於我們實際上使用的是Node<T>並且不知道T的大小，因此為每個存儲桶存儲一個指針似乎是合理的。

對於同樣是單鏈表的 hash 表，將每個桶列表用作所有元素列表的（一部分）是有意義的。 否則我們需要為每個節點存儲兩個指針， next_in_bucket和next_in_list 。

這意味着一個bucket指向的“sentinel”（one-before-the-beginning）節點實際上是前一個bucket的最后一個節點……除了列表最前面的bucket，當它真的是總名單哨兵。

代碼中的注釋說

 /*... * The non-empty buckets contain the node before the first node in the * bucket. This design makes it possible to implement something like a * std::forward_list::insert_after on container insertion and * std::forward_list::erase_after on container erase * calls. _M_before_begin is equivalent to * std::forward_list::before_begin. Empty buckets contain * nullptr. Note that one of the non-empty buckets contains * &_M_before_begin which is not a dereferenceable node so the * node pointer in a bucket shall never be dereferenced, only its * next node can be.

（此代碼中的標記是_M_before_begin ）

因此，當我們將元素添加到已填充的存儲桶中時，步驟大致是

void insert_to_non_empty_bucket(Node *n, Key k) {
  Node *sentinel = table[k];
  n->next = sentinel->next;
  sentinel->next = n;
}

再次注意，我們不知道也不關心這里的哨兵是前一個桶的最后一個元素，還是整個列表哨兵。 無論哪種方式，代碼都是相同的（這是首先使用哨兵的原因之一）。

但是，當我們將第一個元素添加到空桶（並且它不是唯一的非空桶）時，我們還有一個額外的步驟：我們需要更新下一個桶的哨兵指針，以指向我們的新節點。 否則我們會有兩個桶都指向列表哨兵。

void insert_to_empty_bucket(Node *n, Key k) {
  Node *sentinel = &list_sentinel; // ie, &_M_before_begin
  n->next = sentinel->next;
  sentinel->next = n;

  // update the *next* bucket in the table
  table[n->next->key] = n;
}

最后：在這個實現中， Node沒有緩存 key ，所以沒有n->next->key 。 實際上有一個特征控制它，但在這種情況下顯然是錯誤的，這意味着最后一行必須重新計算 hash 才能更新下一個桶。

注意。 只是為了澄清一下，當我說上一個存儲桶或下一個存儲桶時，我只是在談論列表中的 position ，其中存儲桶的出現順序與它們變為非空時的順序相反。 它與表中的 position 沒有任何關系，或暗示任何內在排序。

Answer 2

正如其他人指出的那樣，一個無序的 map 只是 hash 表的一種形式，在 libstdc++ 中基本上只是作為單個（“全局”）鏈表實現的。 此外，還有一系列指向此列表的存儲桶。 重要的是存儲在bucket[i]中的指針並不指向屬於該存儲桶的第一個節點（根據 hash function 映射），而是指向其在全局列表中的前任。 原因很明顯——當你將一個項目添加到單鏈表中時，你需要更新它的前任。 這里，當你需要向某個桶中插入一個元素時，你需要更新這個桶的第一個節點的前驅。

但是，全局鏈表的第一個節點沒有任何前任。 為了使事情統一，有一個哨兵節點扮演這個角色。 在 libstdc++ 中，它是一個成員變量_M_before_begin 。

假設我們有一個 hash 表，其中鍵A和B屬於bucket[0] ，鍵C屬於bucket[1] 。 例如，它可能如下所示：

global linked list          buckets[]
------------------          ---------

_M_before_begin  <--------  bucket[0]
       |
       v
node_with_key_A 
       |
       v
node_with_key_B  <--------  bucket[1]
       |
       v
node_with_key_C
       |
       x

現在，當一個新的鍵，比如D被添加到一個空的桶中，比如bucket[2] ，libstdc++ 將它插入到全局鏈表的開頭。

因此，本次插入后的情況如下：

global linked list          buckets[]
------------------          ---------

_M_before_begin  <--------  bucket[2]
       |
       v
node_with_key_D  <--------  bucket[0]
       |
       v
node_with_key_A 
       |
       v
node_with_key_B  <--------  bucket[1]
       |
       v
node_with_key_C
       |
       x

請注意，與node_with_key_A指向的_M_before_begin對應的bucket[0] 需要更新。 並且，正如其他人再次指出的那樣，由於 libstdc++ 默認情況下不緩存 hash 值，因此如何找到node_with_key_A的存儲桶索引的唯一選擇是觸發 hash ZC1C425268E683894D1AB457A。

請注意，基本上我只是和其他人說的一樣，但想添加一些可能會有所幫助的插圖。

這種方法的另一個后果是 hash function 可能在查找期間被調用： https://godbolt.org/z/K6qhW 原因是某個桶的第一個元素是已知的，但不是最后一個。 因此，在鏈表遍歷過程中，需要解析節點鍵的hash function，以確定節點是否仍然屬於實際桶。

Answer 3

我無法解釋為什么這樣做，但它不適合評論，所以我把它留在答案部分。 插入元素后，stdlib (10.1.0) 中有兩個部分：

__hash_code __code = __h->_M_hash_code(__k);

它計算要插入__k的元素的 hash 值。

稍后在這部分代碼中：

    {
      // The bucket is empty, the new node is inserted at the
      // beginning of the singly-linked list and the bucket will
      // contain _M_before_begin pointer.
      __node->_M_nxt = _M_before_begin._M_nxt;
      _M_before_begin._M_nxt = __node;
      if (__node->_M_nxt)
        // We must update former begin bucket that is pointing to
        // _M_before_begin.
        _M_buckets[_M_bucket_index(__node->_M_next())] = __node;
      _M_buckets[__bkt] = &_M_before_begin;
    }

其中_M_bucket_index為__node->_M_next()計算 hash ， __node指的是為__k創建的節點。

也許這可以幫助您或其他人進一步解釋它。

Answer 4

也許是std::unordered_map的實現。 hash_value不存儲到每個節點中。 因此，在插入新元素或計算存儲桶大小時，它將重新散列下一個存儲桶中的第一個元素。

您可以嘗試使用<tr1/unordered_map>來避免此問題。 例子：

#include <tr1/unordered_map>
using std::tr1::unordered_map;

注意：我不知道是tr1/unordered_map還是unordered_map更好。

unordered_map 對 hash function 的過多調用

問題描述

4 個解決方案

解決方案1
5 2020-09-09 13:24:49

解決方案2
4 已采納 2020-09-09 16:20:58

解決方案3
3 2020-07-22 19:26:57

解決方案4
1 2020-09-09 06:47:42

unordered_map 對 hash function 的過多調用

問題描述

4 個解決方案

解決方案1 5 2020-09-09 13:24:49

解決方案2 4 已采納 2020-09-09 16:20:58

解決方案3 3 2020-07-22 19:26:57

解決方案4 1 2020-09-09 06:47:42

解決方案1
5 2020-09-09 13:24:49

解決方案2
4 已采納 2020-09-09 16:20:58

解決方案3
3 2020-07-22 19:26:57

解決方案4
1 2020-09-09 06:47:42