简体   繁体   English

`std :: vector`的快速哈希函数

[英]Fast hash function for `std::vector`

I implemented this solution for getting an hash value from vector<T> : 我实现了这个解决方案,用于从vector<T>获取哈希值:

namespace std
{
    template<typename T>
    struct hash<vector<T>>
    {
        typedef vector<T> argument_type;
        typedef std::size_t result_type;
        result_type operator()(argument_type const& in) const
        {
            size_t size = in.size();
            size_t seed = 0;
            for (size_t i = 0; i < size; i++)
                //Combine the hash of the current vector with the hashes of the previous ones
                hash_combine(seed, in[i]);
            return seed;
        }
    };
}

//using boost::hash_combine
template <class T>
inline void hash_combine(std::size_t& seed, T const& v)
{
    seed ^= std::hash<T>()(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
}

But this solution doesn't scale at all: with a vector<double> of 10 millions elements it's gonna take more than 2.5 s (according to VS). 但是这个解决方案根本没有扩展:使用1000万个元素的vector<double> ,它将花费超过2.5秒(根据VS)。

Does exists a fast hash function for this scenario? 是否存在针对此场景的快速哈希函数?

Notice that creating an hash value from the vector reference is not a feasible solution, since the related unordred_map will be used in different runs and in addition two vector<double> with the same content but different addresses will be mapped differently (undesired behavior for this application). 请注意,从向量引用创建哈希值不是一个可行的解决方案,因为相关的unordred_map将在不同的运行中使用,此外,两个具有相同内容但不同地址的vector<double>将以不同方式映射(此行为的不良行为)应用)。

NOTE: As per the comments , you get a 25-50x speed-up by compiling with optimizations. 注: 由于 每个 意见 ,你会得到通过优化编译25-50x加速。 Do that, first. 首先,这样做。 Then , if it's still too slow, see below. 然后 ,如果它仍然太慢,请参阅下文。


I don't think there's much you can do. 我认为你无能为力。 You have to touch all the elements, and that combination function is about as fast as it gets. 必须触摸所有元素,并且组合函数的速度和它一样快。

One option may be to parallelize the hash function. 一种选择可以是并行化散列函数。 If you have 8 cores, you can run 8 threads to each hash 1/8th of the vector, then combine the 8 resulting values at the end. 如果你有8个核心,你可以运行8个线程到每个散列1/8的向量,然后在最后组合8个结果值。 The synchronization overhead may be worth it for very large vectors. 对于非常大的向量,同步开销可能是值得的。

The approach that MSVC's old hashmap used was to sample less often. MSVC旧的hashmap使用的方法是不经常采样。

This means that isolated changes won't show up in your hash, but the thing you are trying to avoid is reading and processing the entire 80 mb of data in order to hash your vector. 这意味着隔离的更改不会显示在您的哈希中,但您要避免的是读取和处理整个80 MB的数据以便对您的向量进行哈希处理。 Not reading some characters is pretty unavoidable. 不读一些字是非常不可避免的。

The second thing you should do is not specialize std::hash on all vector s , this may make your program ill-formed (as suggested by a defect resolution whose status I do not recall), and at the least is a bad plan (as the std is sure to permit itself to add hash combining and hashing of vectors). 你应该做的第二件事是不在所有vector上专门化std::hash ,这可能会使你的程序格式不正确(正如我不记得的缺陷解决方案所示),至少是一个糟糕的计划(因为std肯定允许自己添加散列组合和散列矢量)。

When I write a custom hash, I usually use ADL (Koenig Lookup) to make it easy to extend. 当我编写自定义哈希时,我通常使用ADL(Koenig Lookup)来扩展它。

namespace my_utils {
  namespace hash_impl {
    namespace details {
      namespace adl {
        template<class T>
        std::size_t hash(T const& t) {
          return std::hash<T>{}(t);
        }
      }
      template<class T>
      std::size_t hasher(T const& t) {
        using adl::hash;
        return hash(t);
      }
    }
    struct hash_tag {};
    template<class T>
    std::size_t hash(hash_tag, T const& t) {
      return details::hasher(t);
    }
    template<class T>
    std::size_t hash_combine(hash_tag, std::size_t seed, T const& t) {
      seed ^= hash(t) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
    }
    template<class Container>
    std::size_t fash_hash_random_container(hash_tag, Container const& c ) {
      std::size_t size = c.size();
      std::size_t stride = 1 + size/10;
      std::size_t r = hash(hash_tag{}, size);
      for(std::size_t i = 0; i < size; i += stride) {
        r = hash_combine(hash_tag{}, r, c.data()[i])
      }
      return r;
    }
    // std specializations go here:
    template<class T, class A>
    std::size_t hash(hash_tag, std::vector<T,A> const& v) {
      return fash_hash_random_container(hash_tag{}, v);
    }
    template<class T, std::size_t N>
    std::size_t hash(hash_tag, std::array<T,N> const& a) {
      return fash_hash_random_container(hash_tag{}, a);
    }
    // etc
  }
  struct my_hasher {
    template<class T>
    std::size_t operator()(T const& t)const {
      return hash_impl::hash(hash_impl::hash_tag{}, t);
    }
  };
}

now my_hasher is a universal hasher. 现在my_hasher是一个普遍的哈希。 It uses either hashes declared in my_utils::hash_impl (for std types), or free functions called hash that will hash a given type, to hash things. 它使用声明为哈希my_utils::hash_impl (对于std型),或免费的功能叫做hash将哈希给定类型,刨根问底。 Failing that, it tries to use std::hash<T> . 如果失败了,它会尝试使用std::hash<T> If that fails, you get a compile-time error. 如果失败,则会出现编译时错误。

Writing a free hash function in the namespace of the type you want to hash tends to be less annoying than having to go off and open std and specialize std::hash in my experience. 在你想要哈希的类型的命名空间中编写一个免费的hash函数往往比在我的经验中不得不关闭和打开std并专门化std::hash更烦人。

It understands vectors and arrays, recursively. 它以递归方式理解向量和数组。 Doing tuples and pairs requires a bit more work. 做元组和对需要更多的工作。

It samples said vectors and arrays at about 10 times. 它对所述载体和阵列进行约10次采样。

(Note: hash_tag is both a bit of a joke, and a way to force ADL and prevent having to forward-declare the hash specializations in the hash_impl namespace, because that requirement sucks.) (注: hash_tag是既带有一点玩笑的,并且这种方式强制ADL和防止不得不前瞻性声明的hash在专业化hash_impl命名空间,因为这要求很烂。)

The price of sampling is that you could get more collisions. 抽样的价格是你可能会遇到更多的冲突。


Another approach if you have a huge amount of data is to hash them once , and keep track of when they are modified. 如果您拥有大量数据,另一种方法是将它们哈希一次 ,并跟踪它们何时被修改。 To do this approach, use a copy-on-write monad interface for your type that keeps track of if the hash is up to date. 要执行此方法,请为您的类型使用copy-on-write monad接口,以跟踪散列是否是最新的。 Now a vector gets hashed once; 现在一个矢量被哈希一次; if you modify it, the hash is discarded. 如果你修改它,哈希就会被丢弃。

One can go futher and have a random-access hash (where it is easy to predict what happens when you edit a given value hash-wise), and mediate all access to the vector. 可以进一步使用随机访问哈希(可以很容易地预测当您以哈希方式编辑给定值时会发生什么),并调解对向量的所有访问。 That is tricky. 这很棘手。

You could also multi-thread the hashing, but I would guess that your code is probably memory-bandwidth bound, and multi-threading won't help much there. 你也可以对哈希进行多线程处理,但我猜你的代码可能是内存带宽限制的,多线程在那里也没什么用处。 Worth trying. 值得尝试。

You could use a fancier structure than a flat vector (something tree like), where changes to the values bubble-up in a hash-like way to a root hash value. 您可以使用比平面向量(类似于树)更高级的结构,其中值的更改以类似哈希的方式冒泡到根哈希值。 This would add a lg(n) overhead to all element access. 这将为所有元素访问添加lg(n)开销。 Again, you'd have to wrap the raw data up in controls that keep the hashing up to date (or, keep track of what ranges are dirty and needs to be updated). 同样,您必须将原始数据包装在控件中,以使散列保持最新(或者,跟踪哪些范围是脏的并且需要更新)。

Finally, because you are working with 10 million elements at a time, consider moving over to a strong large-scale storage solution, like databases or what have you. 最后,因为您一次使用1000万个元素,请考虑转移到强大的大型存储解决方案,如数据库或您拥有的内容。 Using 80 megabyte keys in a map seems strange to me. 在地图中使用80兆字节的密钥对我来说似乎很奇怪。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM