简体   繁体   English

如何改善短弦的散列以避免碰撞?

[英]How to improve hashing for short strings to avoid collisions?

I am having a problem with hash collisions using short strings in .NET4. 我在.NET4中使用短字符串的哈希冲突有问题。
EDIT: I am using the built-in string hashing function in .NET. 编辑:我在.NET中使用内置的字符串散列函数。

I'm implementing a cache using objects that store the direction of a conversion like this 我正在使用存储转换方向的对象来实现缓存

public class MyClass
{
    private string _from;
    private string _to;

   // More code here....

    public MyClass(string from, string to)
    {
        this._from = from;
        this._to = to;
    }

    public override int GetHashCode()
    {
        return string.Concat(this._from, this._to).GetHashCode();
    }

    public bool Equals(MyClass other)
    {
        return this.To == other.To && this.From == other.From;
    }

    public override bool Equals(object obj)
    {
        if (obj == null) return false;
        if (this.GetType() != obj.GetType()) return false;
        return Equals(obj as MyClass);
    }
}

This is direction dependent and the from and to are represented by short strings like "AAB" and "ABA". 这取决于方向, fromto由短字符串表示,如“AAB”和“ABA”。

I am getting sparse hash collisions with these small strings, I have tried something simple like adding a salt (did not work). 我正在使用这些小字符串进行稀疏哈希冲突,我尝试了一些简单的方法,例如添加盐(不起作用)。

The problem is that too many of my small strings like "AABABA" collides its hash with the reverse of "ABAAAB" (Note that these are not real examples, I have no idea if AAB and ABA actually cause collisions!) 问题是我的太多小字符串如“AABABA”与“ABAAAB”的反向冲突(注意这些不是真实的例子,我不知道AAB和ABA是否真的导致冲突!)

and I have gone heavy duty like implementing MD5 (which works, but is MUCH slower) 而且我已经像执行MD5一样承担了重任(虽然有效,但速度慢很多)

I have also implemented the suggestion from Jon Skeet here: 我还在这里实施了Jon Skeet的建议:
Should I use a concatenation of my string fields as a hash code? 我应该使用字符串字段的串联作为哈希码吗? This works but I don't know how dependable it is with my various 3-character strings. 这有效,但我不知道我的各种3字符字符串是多么可靠。

How can I improve and stabilize the hashing of small strings without adding too much overhead like MD5? 如何在不增加MD5等过多开销的情况下改善和稳定小字符串的散列?

EDIT: In response to a few of the answers posted... the cache is implemented using concurrent dictionaries keyed from MyClass as stubbed out above. 编辑:响应发布的一些答案...缓存是使用从MyClass键入的并发字典实现的,如上所述。 If I replace the GetHashCode in the code above with something simple like @JonSkeet 's code from the link I posted: 如果我用上面的代码替换上面代码中的GetHashCode ,就像我发布的链接中的@JonSkeet代码一样:

int hash = 17;
hash = hash * 23 + this._from.GetHashCode();
hash = hash * 23 + this._to.GetHashCode();        
return hash;

Everything functions as expected. 一切都按预期运作。 It's also worth noting that in this particular use-case the cache is not used in a multi-threaded environment so there is no race condition. 值得注意的是,在这个特定的用例中,缓存不在多线程环境中使用,因此没有竞争条件。

EDIT: I should also note that this misbehavior is platform dependant. 编辑:我还应该注意,这种不当行为取决于平台。 It works as intended on my fully updated Win7x64 machine but does not behave properly on a non-updated Win7x64 machine. 它在我完全更新的Win7x64机器上按预期工作,但在未更新的Win7x64机器上表现不正常。 I don't know the extend of what updates are missing but I know it doesn't have Win7 SP1... so I would assume there may also be a framework SP or update it's missing as well. 我不知道更新缺失的扩展但我知道它没有Win7 SP1 ...所以我认为可能还有一个框架SP或更新它也缺失。

EDIT: As susggested, my issue was not caused by a problem with the hashing function. 编辑:由于持续存在,我的问题不是由散列函数问题引起的。 I had an elusive race condition, which is why it worked on some computers but not others and also why a "slower" hashing method made things work properly. 我有一个难以捉摸的竞争条件,这就是为什么它在一些计算机上工作但不在其他计算机上工作,以及为什么一个“慢”的哈希方法使事情正常工作。 The answer I selected was the most useful in understanding why my problem was not hash collisions in the dictionary. 我选择的答案最有用的是理解为什么我的问题不是字典中的哈希冲突。

Are you sure that collisions are who causes problems? 你确定碰撞是谁导致问题吗? When you say 当你说

I finally discovered what was causing this bug 我终于发现了导致这个bug的原因

You mean some slowness of your code or something else? 你的意思是你的代码有些缓慢或其他什么? If not I'm curious what kind of problem is that? 如果不是我很好奇那是什么问题? Because any hash function (except "perfect" hash functions on limited domains) would cause collisions. 因为任何散列函数(有限域上的“完美”散列函数除外)都会导致冲突。

I put a quick piece of code to check for collisions for 3-letter words. 我快速编写了一段代码来检查3个字母单词的冲突。 And this code doesn't report a single collision for them. 此代码不会为它们报告单个冲突。 You see what I mean? 你明白我的意思吗? Looks like buid-in hash algorithm is not so bad. 看起来像buid-in哈希算法并不是那么糟糕。

Dictionary<int, bool> set = new Dictionary<int, bool>();
char[] buffer = new char[3];
int count = 0;
for (int c1 = (int)'A'; c1 <= (int)'z'; c1++)
{
    buffer[0] = (char)c1;
    for (int c2 = (int)'A'; c2 <= (int)'z'; c2++)
    {
        buffer[1] = (char)c2;
        for (int c3 = (int)'A'; c3 <= (int)'z'; c3++)
        {
            buffer[2] = (char)c3;
            string str = new string(buffer);
            count++;
            int hash = str.GetHashCode();
            if (set.ContainsKey(hash))
            {
                Console.WriteLine("Collision for {0}", str);
            }
            set[hash] = false;
        }
    }
}

Console.WriteLine("Generated {0} of {1} hashes", set.Count, count);

While you could pick almost any of well-known hash functions (as David mentioned) or even choose a "perfect" hash since it looks like your domain is limited (like minimum perfect hash)... It would be great to understand if the source of problems are really collisions. 虽然你可以选择几乎任何一个众所周知的哈希函数(如大卫提到的那样),或者甚至选择一个“完美”哈希,因为看起来你的域名是有限的(比如最小完美哈希)......如果能够理解它是很好的问题的根源是真正的碰撞。

Update 更新

What I want to say is that .NET build-in hash function for string is not so bad. 我想说的是,字符串的.NET内置哈希函数并不是那么糟糕。 It doesn't give so many collisions that you would need to write your own algorithm in regular scenarios. 它不会给你在常规场景中编写自己的算法所需的那么多冲突。 And this doesn't depend on the lenght of strings. 而这并不取决于字符串的长度。 If you have a lot of 6-symbol strings that doesn't imply that your chances to see a collision are highier than with 1000-symbol strings. 如果你有很多6符号字符串,并不意味着你看到碰撞的机会高于1000符号字符串。 This is one of the basic properties of hash functions. 这是散列函数的基本属性之一。

And again, another question is what kind of problems do you experience because of collisions? 而且,另一个问题是,由于碰撞,您遇到了什么样的问题? All build-in hashtables and dictionaries support collision resolution. 所有内置哈希表和字典都支持冲突解决。 So I would say all you can see is just... probably some slowness. 所以我会说你只能看到......可能有些缓慢。 Is this your problem? 这是你的问题吗?

As for your code 至于你的代码

return string.Concat(this._from, this._to).GetHashCode(); 

This can cause problems. 这可能会导致问题。 Because on every hash code calculation you create a new string. 因为在每个哈希代码计算中都会创建一个新字符串。 Maybe this is what causes your issues? 也许这是导致你的问题的原因?

int hash = 17; 
hash = hash * 23 + this._from.GetHashCode(); 
hash = hash * 23 + this._to.GetHashCode();         
return hash; 

This would be much better approach - just because you don't create new objects on the heap. 这将是更好的方法 - 只是因为您不在堆上创建新对象。 Actually it's one of the main points of this approach - get a good hash code of an object with a complex "key" without creating new objects. 实际上,这是这种方法的要点之一 - 使用复杂的“密钥”获取对象的良好哈希码,而无需创建新对象。 So if you don't have a single value key then this should work for you. 因此,如果您没有单个值键,那么这应该适合您。 BTW, this is not a new hash function, this is just a way to combine existing hash values without compromising main properties of hash functions. 顺便说一句,这不是一个新的哈希函数,这只是一种结合现有哈希值而不影响哈希函数主要属性的方法。

Any common hash function should be suitable for this purpose. 任何常见的哈希函数都应该适用于此目的。 If you're getting collisions on short strings like that, I'd say you're using an unusually bad hash function. 如果你在这样的短字符串上发生冲突,我会说你正在使用异常糟糕的哈希函数。 You can use Jenkins or Knuth's with no issues. 你可以毫无问题地使用JenkinsKnuth

Here's a very simple hash function that should be adequate. 这是一个非常简单的哈希函数,应该是足够的。 (Implemented in C, but should easily port to any similar language.) (在C中实现,但应该很容易移植到任何类似的语言。)

unsigned int hash(const char *it)
{
 unsigned hval=0;
 while(*it!=0)
 {
  hval+=*it++;
  hval+=(hval<<10);
  hval^=(hval>>6);
  hval+=(hval<<3);
  hval^=(hval>>11);
  hval+=(hval<<15);
 }
 return hval;
}

Note that if you want to trim the bits of the output of this function, you must use the least significant bits. 请注意,如果要修剪此函数输出的位,则必须使用最低有效位。 You can also use mod to reduce the output range. 您还可以使用mod来减小输出范围。 The last character of the string tends to only affect the low-order bits. 字符串的最后一个字符往往只影响低位。 If you need a more even distribution, change return hval; 如果您需要更均匀的分布,请更改return hval; to return hval * 2654435761U; return hval * 2654435761U; .

Update : 更新

public override int GetHashCode()
{
    return string.Concat(this._from, this._to).GetHashCode();
}

This is broken. 这已破了。 It treats from="foot",to="ar" as the same as from="foo",to="tar". 它将=“foot”,to =“ar”视为与from =“foo”相同,而不是“tar”。 Since your Equals function doesn't consider those equal, your hash function should not. 由于您的Equals函数不认为它们相等,因此您的散列函数不应该。 Possible fixes include: 可能的修复包括:

1) Form the string from,"XXX",to and hash that. 1)将字符串从“XXX”形成为哈希值。 (This assumes the string "XXX" almost never appears in your input strings. (这假设字符串“XXX”几乎不会出现在输入字符串中。

2) Combine the hash of 'from' with the hash of 'to'. 2)将'from'的散列与'to'的散列相结合。 You'll have to use a clever combining function. 你必须使用一个聪明的组合功能。 For example, XOR or sum will cause from="foo",to="bar" to hash the same as from="bar",to="foo". 例如,XOR或sum将导致from =“foo”,to =“bar”以从=“bar”散列到=“foo”。 Unfortunately, choosing the right combining function is not easy without knowing the internals of the hashing function. 不幸的是,如果不了解散列函数的内部结构,选择正确的组合函数并不容易。 You can try: 你可以试试:

int hc1=from.GetHashCode();
int hc2=to.GetHashCode();
return (hc1<<7)^(hc2>>25)^(hc1>>21)^(hc2<<11);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM