简体   繁体   English

C++ std::unordered_map 中使用的默认哈希函数是什么?

[英]What is the default hash function used in C++ std::unordered_map?

I am using我在用

unordered_map<string, int>

and

unordered_map<int, int>

What hash function is used in each case and what is chance of collision in each case?每种情况下使用什么哈希函数,每种情况下发生碰撞的几率是多少? I will be inserting unique string and unique int as keys in each case respectively.我将分别在每种情况下插入唯一字符串和唯一 int 作为键。

I am interested in knowing the algorithm of hash function in case of string and int keys and their collision stats.我有兴趣了解字符串和 int 键及其碰撞统计信息的哈希函数算法。

The function object std::hash<> is used.使用函数对象std::hash<>

Standard specializations exist for all built-in types, and some other standard library types such as std::string and std::thread .所有内置类型以及一些其他标准库类型(例如std::stringstd::thread See the link for the full list.查看完整列表的链接。

For other types to be used in a std::unordered_map , you will have to specialize std::hash<> or create your own function object.对于要在std::unordered_map使用的其他类型,您必须专门化std::hash<>或创建自己的函数对象。

The chance of collision is completely implementation-dependent, but considering the fact that integers are limited between a defined range, while strings are theoretically infinitely long, I'd say there is a much better chance for collision with strings.碰撞的机会完全取决于实现,但考虑到整数限制在定义的范围内这一事实,而字符串理论上是无限长的,我会说与字符串碰撞的机会要大得多。

As for the implementation in GCC, the specialization for builtin-types just returns the bit pattern.至于在 GCC 中的实现,内置类型的特化只返回位模式。 Here's how they are defined in bits/functional_hash.h :以下是它们在bits/functional_hash.h中的定义方式:

  /// Partial specializations for pointer types.
  template<typename _Tp>
    struct hash<_Tp*> : public __hash_base<size_t, _Tp*>
    {
      size_t
      operator()(_Tp* __p) const noexcept
      { return reinterpret_cast<size_t>(__p); }
    };

  // Explicit specializations for integer types.
#define _Cxx_hashtable_define_trivial_hash(_Tp)     \
  template<>                        \
    struct hash<_Tp> : public __hash_base<size_t, _Tp>  \
    {                                                   \
      size_t                                            \
      operator()(_Tp __val) const noexcept              \
      { return static_cast<size_t>(__val); }            \
    };

  /// Explicit specialization for bool.
  _Cxx_hashtable_define_trivial_hash(bool)

  /// Explicit specialization for char.
  _Cxx_hashtable_define_trivial_hash(char)

  /// ...

The specialization for std::string is defined as: std::string的特化定义为:

#ifndef _GLIBCXX_COMPATIBILITY_CXX0X
  /// std::hash specialization for string.
  template<>
    struct hash<string>
    : public __hash_base<size_t, string>
    {
      size_t
      operator()(const string& __s) const noexcept
      { return std::_Hash_impl::hash(__s.data(), __s.length()); }
    };

Some further search leads us to:一些进一步的搜索导致我们:

struct _Hash_impl
{
  static size_t
  hash(const void* __ptr, size_t __clength,
       size_t __seed = static_cast<size_t>(0xc70f6907UL))
  { return _Hash_bytes(__ptr, __clength, __seed); }
  ...
};
...
// Hash function implementation for the nontrivial specialization.
// All of them are based on a primitive that hashes a pointer to a
// byte array. The actual hash algorithm is not guaranteed to stay
// the same from release to release -- it may be updated or tuned to
// improve hash quality or speed.
size_t
_Hash_bytes(const void* __ptr, size_t __len, size_t __seed);

_Hash_bytes is an external function from libstdc++ . _Hash_bytes是来自libstdc++的外部函数。 A bit more searching led me to this file , which states:更多的搜索使我找到了这个文件,其中指出:

// This file defines Hash_bytes, a primitive used for defining hash
// functions. Based on public domain MurmurHashUnaligned2, by Austin
// Appleby.  http://murmurhash.googlepages.com/

So the default hashing algorithm GCC uses for strings is MurmurHashUnaligned2.因此,GCC 用于字符串的默认散列算法是 MurmurHashUnaligned2。

GCC C++11 uses "MurmurHashUnaligned2", by Austin Appleby GCC C++11 使用“MurmurHashUnaligned2”,作者 Austin Appleby

Though the hashing algorithms are compiler-dependent, I'll present it for GCC C++11.尽管散列算法依赖于编译器,但我将在 GCC C++11 中展示它。 @Avidan Borisov astutely discovered that the GCC hashing algorithm used for strings is "MurmurHashUnaligned2," by Austin Appleby. @Avidan Borisov 敏锐地发现用于字符串的 GCC 哈希算法是 Austin Appleby 的“MurmurHashUnaligned2”。 I did some searching and found a mirrored copy of GCC on Github.我做了一些搜索,并在 Github 上找到了 GCC 的镜像副本。 Therefore:所以:

The GCC C++11 hashing functions used forunordered_map (a hash table template) andunordered_set (a hash set template) appear to be as follows.用于unordered_map (哈希表模板)和unordered_set (哈希集模板)的 GCC C++11 哈希函数如下所示。

Code:代码:

// Implementation of Murmur hash for 32-bit size_t.
size_t _Hash_bytes(const void* ptr, size_t len, size_t seed)
{
  const size_t m = 0x5bd1e995;
  size_t hash = seed ^ len;
  const char* buf = static_cast<const char*>(ptr);

  // Mix 4 bytes at a time into the hash.
  while (len >= 4)
  {
    size_t k = unaligned_load(buf);
    k *= m;
    k ^= k >> 24;
    k *= m;
    hash *= m;
    hash ^= k;
    buf += 4;
    len -= 4;
  }

  // Handle the last few bytes of the input array.
  switch (len)
  {
    case 3:
      hash ^= static_cast<unsigned char>(buf[2]) << 16;
      [[gnu::fallthrough]];
    case 2:
      hash ^= static_cast<unsigned char>(buf[1]) << 8;
      [[gnu::fallthrough]];
    case 1:
      hash ^= static_cast<unsigned char>(buf[0]);
      hash *= m;
  };

  // Do a few final mixes of the hash.
  hash ^= hash >> 13;
  hash *= m;
  hash ^= hash >> 15;
  return hash;
}

The latest version of Austin Appleby's hashing functions is "MurmerHash3", which is released into the public domain! Austin Appleby 散列函数的最新版本是“MurmerHash3”,已发布到公共领域!

Austin states in his readme :奥斯汀在他的自述文件中说

The SMHasher suite also includes MurmurHash3 , which is the latest version in the series of MurmurHash functions - the new version is faster, more robust, and its variants can produce 32- and 128-bit hash values efficiently on both x86 and x64 platforms. SMHasher 套件还包括MurmurHash3 ,它是 MurmurHash 函数系列中的最新版本——新版本更快、更健壮,其变体可以在 x86 和 x64 平台上高效地生成 32 位和 128 位哈希值。

For MurmerHash3's source code, see here:对于 MurmerHash3 的源代码,请参见此处:

  1. MurmurHash3.h 杂音哈希3.h
  2. MurmurHash3.cpp MurmurHash3.cpp

And the great thing is!?最棒的是!? It's public domain software.它是公共领域的软件。 That's right!这是正确的! The tops of the files state:文件的顶部状态:

 // MurmurHash3 was written by Austin Appleby, and is placed in the public // domain. The author hereby disclaims copyright to this source code.

So, if you'd like to use MurmerHash3 in your open source software, personal projects, or proprietary software, including for implementing your own hash tables in C, go for it!所以,如果你想在你的开源软件、个人项目或专有软件中使用 MurmerHash3,包括用 C 实现你自己的哈希表,那就去吧!

If you'd like build instructions to build and test his MurmerHash3 code, I've written some here: https://github.com/ElectricRCAircraftGuy/smhasher/blob/add_build_instructions/build/README.md .如果您想要构建和测试他的 MurmerHash3 代码的构建说明,我在这里写了一些: https : //github.com/ElectricRCAircraftGuy/smhasher/blob/add_build_instructions/build/README.md Hopefully this PR I've opened gets accepted and then they will end up in his main repo.希望我打开的这个 PR被接受,然后它们最终会出现在他的主要回购中。 But, until then, refer to the build instructions in my fork.但是,在那之前,请参阅我的 fork 中的构建说明。

For additional hashing functions, including djb2 , and the 2 versions of the K&R hashing functions...对于其他散列函数,包括djb2和 K&R 散列函数的 2 个版本...

...(one apparently terrible, one pretty good), see my other answer here: hash function for string . ...(一个显然很糟糕,一个很好),请在此处查看我的另一个答案: string 的哈希函数

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM