简体   繁体   English

散列数值向量的方法?

[英]Ways to hash a numeric vector?

Are there any known hash algorithms which input a vector of int's and output a single int that work similarly to an inner product? 是否有已知的哈希算法输入int的向量并输出与内部乘积相似的单个int?

In other words, I am thinking about a hash algorithm that might look like this in C++: 换句话说,我正在考虑一种哈希算法,在C ++中可能看起来像这样:

// For simplicity, I'm not worrying about overflow, and assuming |v| < 7.
int HashVector(const vector<int>& v) {
  const int N = kSomethingBig;
  const int w[] = {234, 739, 934, 23, 828, 194};  // Carefully chosen constants.
  int result = 0;
  for (int i = 0; i < v.size(); ++i) result = (result + w[i] * v[i]) % N;
  return result;
}

I'm interested in this because I'm writing up a paper on an algorithm that would benefit from any previous work on similar hashes. 我对此很感兴趣,因为我正在写一篇关于算法的论文,该论文将受益于以前任何有关类似哈希的工作。 In particular, it would be great if there is anything known about the collision properties of a hash algorithm like this. 特别是,如果对这样的哈希算法的冲突属性有任何了解,那就太好了。

The algorithm I'm interested in would hash integer vectors, but something for float vectors would also be cool. 我感兴趣的算法是对整数向量进行哈希处理,但是对于浮点向量来说也很酷。

Clarification 澄清度

The hash is intended for use in a hash table for fast key/value lookups. 散列旨在用于快速键/值查找的散列表中。 There is no security concern here. 这里没有安全问题。

The desired answer is something like a set of constants that provably work particularly well for a hash like this - analogous to a multiplier and modulo which works better than others as a pseudorandom number generator. 所需的答案类似于一组常量,对于像这样的哈希,可以证明工作得特别好-类似于乘法器和模,它比伪随机数生成器要好得多。

For example, some choices of constants for a linear congruential pseudorandom generator are known to give optimal cycle lengths and have easy-to-compute modulos. 例如,已知线性同余伪随机数生成器的一些常数选择可提供最佳周期长度,并具有易于计算的模数。 Maybe someone has done research to show that a certain set of multiplicative constants, along with a modulo constant, in a vector hash can reduce the chance of collisions amongst nearby integer vectors. 也许有人研究表明,向量散列中的一组乘法常数以及模常数可以减少附近整数向量之间发生碰撞的机会。

I did some (unpublished, practical) experiments with testing a variety of string hash algorithms. 我做了一些(未发表的,实用的)实验,测试了各种字符串哈希算法。 (It turns out that Java's default hash function for Strings sucks.) (事实证明,Java的String的默认哈希函数很烂。)

The easy experiment is to hash the English dictionary and compare how many collisions you have on algorithm A vs algorithm B. 一个简单的实验是对英语词典进行哈希处理,比较算法A与算法B发生的碰撞次数。

You can construct a similar experiment: randomly generate $BIG_NUMBER of possible vectors of length 7 or less. 您可以构建类似的实验:随机生成$ BIG_NUMBER个长度为7或更小的可能向量。 Hash them on algorithm A, hash them on algorithm B, then compare number and severity of collisions. 将它们哈希在算法A上,将它们哈希在算法B上,然后比较冲突的数量和严重性。

After you're able to do that, you can use simulated annealing or similar techniques to find "magic numbers" which perform well for you. 完成此操作后,您可以使用模拟退火或类似技术来找到对您而言效果很好的“幻数”。 In my work, for given vocabularies of interest and a tightly limited hash size, we were able to make a generic algorithm work well for several human languages by varying the "magic numbers". 在我的工作中,对于给定的兴趣词汇表和严格限制的哈希值大小,我们能够通过更改“幻数”使通用算法很好地适用于几种人类语言。

Depending on the size of the constants, I'd have to say the degree of chaos in the input vector will have an impact on the result. 根据常数的大小,我不得不说输入向量中的混乱程度将对结果产生影响。 However, a quick qualitative analysis of your post would suggest that you have a good start: 但是,对您的帖子进行快速的定性分析将表明您有一个良好的开端:

  • Your inputs are multiplied, therefore increasing the degree of separation between similar input values per iteration (for instance, 65 + 66 is much smaller than 65 * 66), which is good. 您的输入被相乘,因此增加了每次迭代相似输入值之间的分离度(例如65 + 66比65 * 66小得多),这很好。
  • It's deterministic, unless your vector should be considered a set and not a sequence. 它是确定性的,除非应将向量视为集合而不是序列。 For clarity, should v = { 23, 30, 37 } be different than v = { 30, 23, 37 }? 为了清楚起见,v = {23,30,37}是否应与v = {30,23,37}不同?
  • The uniformity of distribution will be varied based on the range and chaos of input values in v. However, that's true of a generalized integer hashing algorithm as well. 分布的均匀性将根据v中输入值的范围和混乱程度而变化。但是,广义整数哈希算法也是如此。

Out of curiousity, why not just use an existing hashing algorithm for integers and perform some interesting math on the results? 出于好奇,为什么不只对整数使用现有的哈希算法并对结果执行一些有趣的数学运算呢?

Python used to hash tuples in this manner ( source ): Python过去曾以这种方式对元组进行哈希处理( ):

class tuple:
    def __hash__(self):
        value = 0x345678
        for item in self:
            value = c_mul(1000003, value) ^ hash(item)
        value = value ^ len(self)
        if value == -1:
            value = -2
        return value

In your case, item would always be an integer, which uses this algorithm: 在您的情况下, item始终是整数,使用此算法:

class int:
    def __hash__(self):
        value = self
        if value == -1:
            value == -2
        return value

This does have nothing to do with an inner product, though... so maybe it's not much help. 但是,这与内部产品无关。因此,可能没有太大帮助。

While i might be totally misunderstanding you, maybe it's a good idea to treat a vector as a byte stream and do some know hash on it, ie SHA1 or MD5 . 尽管我可能完全误解了您,但是将向量视为字节流并在其上进行一些已知的哈希(例如SHA1MD5)可能是个好主意。

Just to clarify, those hashes are known to have good hash properties, and i believe there's no reason to reinvent a bicycle and to implement new hash. 只是为了澄清起见,已知这些哈希具有良好的哈希属性,我相信没有理由重新发明自行车并实施新的哈希。 Another possibility is to use known CRC angorithm. 另一种可能性是使用已知的CRC算法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM