C++ 关于 boost::unordered_map & boost::hash 的一些问题

Question

I've only recently started dwelling into boost and it's containers, and I read a few articles on the web and on stackoverflow that a boost::unordered_map is the fastest performing container for big collections.我最近才开始研究 boost 及其容器，我阅读了一些关于 web 和 stackoverflow 的文章，其中 boost::unordered_map 是大型 collections 中性能最快的容器。 So, I have this class State, which must be unique in the container (no duplicates) and there will be millions if not billions of states in the container.所以，我有这个 class State，它在容器中必须是唯一的（没有重复），容器中将有数百万甚至数十亿的状态。 Therefore I have been trying to optimize it for small size and as few computations as possible.因此，我一直在尝试将其优化为小尺寸和尽可能少的计算。 I was using a boost::ptr_vector before, but as I read on stackoverflow a vector is only good as long as there are not that many objects in it.我之前使用过 boost::ptr_vector，但正如我在 stackoverflow 上所读到的，只要其中没有那么多对象，向量才是好的。 In my case, the State descibes sensorimotor information from a robot, so there can be an enormous amount of states, and therefore fast lookup is of topemost priority.在我的例子中，State 描述了来自机器人的感觉运动信息，因此可能存在大量状态，因此快速查找是重中之重。 Following the boost documentation for unordered_map I realize that there are two things I could do to speed things up: use a hash_function, and use an equality operator to compare States based on their hash_function.按照 unordered_map 的boost 文档，我意识到我可以做两件事来加快速度：使用 hash_function，并使用相等运算符根据它们的 hash_function 比较状态。 So, I implemented a private hash() function which takes in State information and using boost::hash_combine, creates an std::size_t hash value.因此，我实现了一个私有 hash() function，它接收 State 信息并使用 boost::hash_combine，创建一个 std::size_t Z0800FC577294C34E0B258AD2839Z43 值。 The operator== compares basically the state's hash values. operator== 基本上比较状态的 hash 值。 So:所以：

is std::size_t enough to cover billions of possible hash_function combinations? std::size_t 是否足以涵盖数十亿可能的 hash_function 组合？ In order to avoid duplicate states I intend to use their hash_values.为了避免重复状态，我打算使用它们的 hash_values。
When creating a state_map, should I use as key the State* or the hash value?创建 state_map 时，我应该使用 State* 还是 hash 值作为键？ ie: boost::unordered_map<State*,std::size_t> state_map;即： boost::unordered_map<State*,std::size_t> state_map; Or boost::unordered_map<std::size_t,State*> state_map;或boost::unordered_map<std::size_t,State*> state_map;
Are the lookup times with a boost::unordered_map::iterator = state_map.find() faster than going through a boost::ptr_vector and comparing each iterator's key value?使用 boost::unordered_map::iterator = state_map.find() 的查找时间是否比通过 boost::ptr_vector 并比较每个迭代器的键值更快？
Finally, any tips or tricks on how to optimize such an unordered map for speed and fast lookups would be greatly appreciated.最后，任何关于如何优化这种无序 map 以实现速度和快速查找的提示或技巧将不胜感激。

EDIT: I have seen quite a few answers, one being not to use boost but C++0X, another not to use an unordered_set, but to be honest, I still want to see how boost::unordered_set is used with a hash function.编辑：我已经看到了很多答案，一个是不使用 boost 但 C++0X，另一个不使用 unordered_set，但老实说，我仍然想看看 boost::unordered_set 如何与 hash function 一起使用. I have followed boost's documentation and implemented, but I still cannot figure out how to use the hash function of boost with the ordered set.我遵循了boost的文档并实施了，但我仍然不知道如何使用有序集的boost hash function。

Answer 1

This is a bit muddled.这有点糊涂了。

What you say are not "things that you can do to speed things up";你所说的不是“你可以做些什么来加快速度”； rather, they are mandatory requirements of your type to be eligible as the element type of an unordered map, and also for an unordered set (which you might rather want).相反，它们是您的类型的强制性要求，才有资格作为无序 map 的元素类型，也适用于无序集（您可能更想要）。
You need to provide an equality operator that compares objects , not hash values.您需要提供一个比较对象的相等运算符，而不是 hash 值。 The whole point of the equality is to distinguish elements with the same hash.相等的全部意义在于区分具有相同 hash 的元素。
size_t is an unsigned integral type, 32 bits on x86 and 64 bits on x64. size_t是无符号整数类型，在 x86 上为 32 位，在 x64 上为 64 位。 Since you want "billions of elements", which means many gigabytes of data, I assume you have a solid x64 machine anyway.由于您想要“数十亿个元素”，这意味着许多 GB 的数据，我假设您无论如何都有一台可靠的 x64 机器。
What's crucial is that your hash function is good , ie has few collisions.关键是您的 hash function 是好的，即很少发生碰撞。
You want a set, not a map.你想要一套，而不是 map。 Put the objects directly in the set: std::unordered_set<State> .将对象直接放入集合中： std::unordered_set<State> 。 Use a map if you are mapping to something, ie states to something else.如果您要映射到某物，即状态到其他某物，请使用 map。 Oh, use C++0x, not boost, if you can.哦，如果可以的话，使用 C++0x，而不是 boost。
Using hash_combine is good.使用hash_combine很好。

Baby example:宝贝示例：

struct State
{
  inline bool operator==(const State &) const;
  /* Stuff */
};

namespace std
{
  template <> struct hash<State>
  {
    inline std::size_t operator()(const State & s) const
    {
      /* your hash algorithm here */
    }
  };
}

std::size_t Foo(const State & s) { /* some code */ }

int main()
{
  std::unordered_set<State> states; // no extra data needed
  std::unordered_set<State, Foo> states; // another hash function
}

Answer 2

An unordered_map is a hashtable. unordered_map 是一个哈希表。 You don't store the hash;您不存储 hash； it is done internally as the storage and lookup method.它作为存储和查找方法在内部完成。

Given your requirements, an unordered_set might be more appropriate, since your object is the only item to store.鉴于您的要求，unordered_set 可能更合适，因为您的 object 是唯一要存储的项目。

You are a little confused though -- the equality operator and hash function are not truly performance items, but required for nontrivial objects for the container to work correctly.不过，您有点困惑——相等运算符和 hash function 并不是真正的性能项目，而是容器正常工作的重要对象所必需的。 A good hash function will distribute your nodes evenly across the buckets, and the equality operator will be used to remove any ambiguity about matches based on the hash function. A good hash function will distribute your nodes evenly across the buckets, and the equality operator will be used to remove any ambiguity about matches based on the hash function.

std::size_t is fine for the hash function. std::size_t 适用于 hash function。 Remember that no hash is perfect;请记住，没有 hash 是完美的； there will be collisions, and these collision items are stored in a linked list at that bucket position.会有碰撞，这些碰撞项存储在该桶 position 的链表中。

Thus, .find() will be O(1) in the optimal case and very close to O(1) in the average case (and O(N) in the worst case, but a decent hash function will avoid that.)因此，.find() 在最佳情况下为 O(1)，在平均情况下非常接近 O(1)（而在最坏情况下为 O(N)，但一个不错的 hash function 将避免这种情况。）

You don't mention your platform or architecture;您没有提及您的平台或架构； at billions of entries you still might have to worry about out-of-memory situations depending on those and the size of your State object.在数十亿个条目中，您可能仍然需要担心内存不足的情况，具体取决于这些情况以及 State object 的大小。

Answer 3

forget about hash;忘记 hash； there is nothing (at least from your question) that suggests you have a meaningful key;没有任何东西（至少从你的问题来看）表明你有一个有意义的钥匙；

lets take a step back and rephrase your actual performance goals:让我们退后一步，重新表述您的实际绩效目标：

you want to quickly validate no duplicates ever exist for any of your State objects您想快速验证任何 State 对象不存在重复项

comment if i need to add others.如果我需要添加其他人，请发表评论。

From the aforementioned goal, and from your comment i would suggest you use actually a ordered_set rather than an unordered_map.从上述目标和您的评论来看，我建议您实际上使用ordered_set而不是unordered_map。 Yes, the ordered search uses binary search O(log (n)) while unordered uses lookup O(1).是的，有序搜索使用二进制搜索 O(log (n))，而无序使用查找 O(1)。

However, the difference is that with this approach you need the ordered_set ONLY to check that a similar state doesn't exist already when you are about to create a new one , that is, at State creation-time .但是，不同之处在于，使用这种方法，您只需要ordered_set 来检查类似的 state在您即将创建一个新的时是否已经存在，即在 State创建时间。

In all the other lookups, you actually don't need to look into the ordered_set;在所有其他查找中，您实际上不需要查看ordered_set； because you already have the key, State*: and the key can access the value by the magic dereference operator: *key因为您已经拥有密钥 State*: 并且密钥可以通过魔术解引用运算符访问值： *key

so with this approach, you only are using the ordered_set as an index to verify States on creation time only.因此，使用这种方法，您仅使用ordered_set 作为索引来仅在创建时间验证状态。 In all the other cases, you access your State with the dereference operator of your pointer-value key.在所有其他情况下，您可以使用指针值键的取消引用运算符访问您的 State。

if all the above wasn't enough to convince you, here is the final nail in the coffin of the idea of using a hash to quickly determine equality;如果以上所有内容都不足以说服您，这里是使用 hash 快速确定相等性的想法的最后钉子； hash function has a small probability of collision, but as the number of states will grow, that probability will become complete certainty. hash function 发生碰撞的概率很小，但是随着状态数量的增加，该概率将变得完全确定。 So depending on your fault-tolerance, you are going to deal with state collisions (and from your question and the number of States you are expecting to deal, it seems you will deal with a lot of them)因此，根据您的容错能力，您将处理 state 碰撞（从您的问题和您期望处理的状态数量来看，您似乎会处理很多）

For this to work, you obviously need the compare predicate to test for all the internal properties of your state (giroscope, thrusters, accelerometers, proton rays, etc.)为此，您显然需要比较谓词来测试 state 的所有内部属性（陀螺仪、推进器、加速度计、质子射线等）

C++ 关于 boost::unordered_map & boost::hash 的一些问题

问题描述

3 个解决方案

解决方案1
4 已采纳 2011-07-14 00:30:52

解决方案2
2 2011-07-14 00:28:56

解决方案3
2 2011-07-14 01:03:05

C++ 关于 boost::unordered_map &amp; boost::hash 的一些问题

问题描述

3 个解决方案

解决方案1 4 已采纳 2011-07-14 00:30:52

解决方案2 2 2011-07-14 00:28:56

解决方案3 2 2011-07-14 01:03:05

C++ 关于 boost::unordered_map & boost::hash 的一些问题

解决方案1
4 已采纳 2011-07-14 00:30:52

解决方案2
2 2011-07-14 00:28:56

解决方案3
2 2011-07-14 01:03:05