C++ std::unordered_map 中使用的默認哈希函數是什么？

Question

我在用

unordered_map<string, int>

和

unordered_map<int, int>

每種情況下使用什么哈希函數，每種情況下發生碰撞的幾率是多少？ 我將分別在每種情況下插入唯一字符串和唯一 int 作為鍵。

我有興趣了解字符串和 int 鍵及其碰撞統計信息的哈希函數算法。

Answer 1

使用函數對象std::hash<> 。

所有內置類型以及一些其他標准庫類型（例如std::string和std::thread 。 查看完整列表的鏈接。

對於要在std::unordered_map使用的其他類型，您必須專門化std::hash<>或創建自己的函數對象。

碰撞的機會完全取決於實現，但考慮到整數限制在定義的范圍內這一事實，而字符串理論上是無限長的，我會說與字符串碰撞的機會要大得多。

至於在 GCC 中的實現，內置類型的特化只返回位模式。 以下是它們在bits/functional_hash.h中的定義方式：

  /// Partial specializations for pointer types.
  template<typename _Tp>
    struct hash<_Tp*> : public __hash_base<size_t, _Tp*>
    {
      size_t
      operator()(_Tp* __p) const noexcept
      { return reinterpret_cast<size_t>(__p); }
    };

  // Explicit specializations for integer types.
#define _Cxx_hashtable_define_trivial_hash(_Tp)     \
  template<>                        \
    struct hash<_Tp> : public __hash_base<size_t, _Tp>  \
    {                                                   \
      size_t                                            \
      operator()(_Tp __val) const noexcept              \
      { return static_cast<size_t>(__val); }            \
    };

  /// Explicit specialization for bool.
  _Cxx_hashtable_define_trivial_hash(bool)

  /// Explicit specialization for char.
  _Cxx_hashtable_define_trivial_hash(char)

  /// ...

std::string的特化定義為：

#ifndef _GLIBCXX_COMPATIBILITY_CXX0X
  /// std::hash specialization for string.
  template<>
    struct hash<string>
    : public __hash_base<size_t, string>
    {
      size_t
      operator()(const string& __s) const noexcept
      { return std::_Hash_impl::hash(__s.data(), __s.length()); }
    };

一些進一步的搜索導致我們：

struct _Hash_impl
{
  static size_t
  hash(const void* __ptr, size_t __clength,
       size_t __seed = static_cast<size_t>(0xc70f6907UL))
  { return _Hash_bytes(__ptr, __clength, __seed); }
  ...
};
...
// Hash function implementation for the nontrivial specialization.
// All of them are based on a primitive that hashes a pointer to a
// byte array. The actual hash algorithm is not guaranteed to stay
// the same from release to release -- it may be updated or tuned to
// improve hash quality or speed.
size_t
_Hash_bytes(const void* __ptr, size_t __len, size_t __seed);

_Hash_bytes是來自libstdc++的外部函數。 更多的搜索使我找到了這個文件，其中指出：

// This file defines Hash_bytes, a primitive used for defining hash
// functions. Based on public domain MurmurHashUnaligned2, by Austin
// Appleby.  http://murmurhash.googlepages.com/

因此，GCC 用於字符串的默認散列算法是 MurmurHashUnaligned2。

Answer 2

GCC C++11 使用“MurmurHashUnaligned2”，作者 Austin Appleby

盡管散列算法依賴於編譯器，但我將在 GCC C++11 中展示它。 @Avidan Borisov 敏銳地發現用於字符串的 GCC 哈希算法是 Austin Appleby 的“MurmurHashUnaligned2”。 我做了一些搜索，並在 Github 上找到了 GCC 的鏡像副本。 所以：

用於unordered_map （哈希表模板）和unordered_set （哈希集模板）的 GCC C++11 哈希函數如下所示。

感謝 Avidan Borisov對GCC C++11 哈希函數是什么問題的背景研究，指出 GCC 使用了 Austin Appleby 的“MurmurHashUnaligned2”的實現（參見http://murmurhash.googlepages.com /和https://github.com/aappleby/smhasher ）。
在文件“gcc/libstdc++-v3/libsupc++/hash_bytes.cc”中，這里（ https://github.com/gcc-mirror/gcc/blob/master/libstdc++-v3/libsupc++/hash_bytes.cc ），我發現實現。 例如，這是“32 位 size_t”返回值的一個（拉 2017 年 8 月 11 日）

代碼：

// Implementation of Murmur hash for 32-bit size_t.
size_t _Hash_bytes(const void* ptr, size_t len, size_t seed)
{
  const size_t m = 0x5bd1e995;
  size_t hash = seed ^ len;
  const char* buf = static_cast<const char*>(ptr);

  // Mix 4 bytes at a time into the hash.
  while (len >= 4)
  {
    size_t k = unaligned_load(buf);
    k *= m;
    k ^= k >> 24;
    k *= m;
    hash *= m;
    hash ^= k;
    buf += 4;
    len -= 4;
  }

  // Handle the last few bytes of the input array.
  switch (len)
  {
    case 3:
      hash ^= static_cast<unsigned char>(buf[2]) << 16;
      [[gnu::fallthrough]];
    case 2:
      hash ^= static_cast<unsigned char>(buf[1]) << 8;
      [[gnu::fallthrough]];
    case 1:
      hash ^= static_cast<unsigned char>(buf[0]);
      hash *= m;
  };

  // Do a few final mixes of the hash.
  hash ^= hash >> 13;
  hash *= m;
  hash ^= hash >> 15;
  return hash;
}

Austin Appleby 散列函數的最新版本是“MurmerHash3”，已發布到公共領域！

奧斯汀在他的自述文件中說：

SMHasher 套件還包括MurmurHash3 ，它是 MurmurHash 函數系列中的最新版本——新版本更快、更健壯，其變體可以在 x86 和 x64 平台上高效地生成 32 位和 128 位哈希值。

對於 MurmerHash3 的源代碼，請參見此處：

雜音哈希3.h
MurmurHash3.cpp

最棒的是！？ 它是公共領域的軟件。 這是正確的！ 文件的頂部狀態：

 // MurmurHash3 was written by Austin Appleby, and is placed in the public // domain. The author hereby disclaims copyright to this source code.

所以，如果你想在你的開源軟件、個人項目或專有軟件中使用 MurmerHash3，包括用 C 實現你自己的哈希表，那就去吧！

如果您想要構建和測試他的 MurmerHash3 代碼的構建說明，我在這里寫了一些： https : //github.com/ElectricRCAircraftGuy/smhasher/blob/add_build_instructions/build/README.md 。 希望我打開的這個 PR被接受，然后它們最終會出現在他的主要回購中。 但是，在那之前，請參閱我的 fork 中的構建說明。

對於其他散列函數，包括`djb2`和 K&R 散列函數的 2 個版本...

...（一個顯然很糟糕，一個很好），請在此處查看我的另一個答案： string 的哈希函數。

C++ std::unordered_map 中使用的默認哈希函數是什么？

問題描述

2 個解決方案

解決方案1
118 已采納 2013-10-16 19:18:38

解決方案2
8 2017-08-11 18:48:29

GCC C++11 使用“MurmurHashUnaligned2”，作者 Austin Appleby

Austin Appleby 散列函數的最新版本是“MurmerHash3”，已發布到公共領域！

對於其他散列函數，包括`djb2`和 K&R 散列函數的 2 個版本...

C++ std::unordered_map 中使用的默認哈希函數是什么？

問題描述

2 個解決方案

解決方案1 118 已采納 2013-10-16 19:18:38

解決方案2 8 2017-08-11 18:48:29

GCC C++11 使用“MurmurHashUnaligned2”，作者 Austin Appleby

Austin Appleby 散列函數的最新版本是“MurmerHash3”，已發布到公共領域！

對於其他散列函數，包括djb2和 K&R 散列函數的 2 個版本...

解決方案1
118 已采納 2013-10-16 19:18:38

解決方案2
8 2017-08-11 18:48:29

對於其他散列函數，包括`djb2`和 K&R 散列函數的 2 個版本...