简体   繁体   English

为什么每次Rabin Karp算法中的哈希值相同时我们都需要检查模式匹配

[英]Why do we need to check for a pattern match everytime the hash value is same in Rabin Karp algorithm

I don't see the reason why we need to check for a substring match every time the hash returns the same value for the pattern and the text. 我不明白为什么每次哈希返回模式和文本的相同值时都需要检查子字符串匹配的原因。 Isn't the hash value returned unique for a string? 哈希值不是唯一的吗?

The hash function that is used in the Rabin Karp algorithm is a " rolling hash " such as the Rabin Fingerprint , chosen because of its property that a hash can be easily computed based on the previous hash, not because of its collision resistance. 在Rabin Karp算法中使用的哈希函数是“ 滚动哈希 ”(如Rabin Fingerprint) ,这是因为其特性是可以轻松地基于先前的哈希计算哈希,而不是因为它的抗冲突性。

In the Rabin Karp algorithm, we need to compute the hash of a sliding substring. 在Rabin Karp算法中,我们需要计算滑动子字符串的哈希值。 Say eg that we're searching for a 24-character string in this text: 假设例如,我们正在此文本中搜索24个字符的字符串:

"this is the text we are comparing"

We would need to compute the hash for these substrings: 我们将需要计算以下子字符串的哈希值:

"this is the text we are "
"his is the text we are c"
"is is the text we are co"
"s is the text we are com"
" is the text we are comp"
"is the text we are compa"
"s the text we are compar"
" the text we are compari"
"the text we are comparin"
"he text we are comparing"

So we choose a "rolling hash" function where, after the hash of the first substring is computed, we can compute the hash of the second substring using the first hash, the character that is removed from the substring, and the character that is added to it: 因此,我们选择“滚动哈希”功能,在计算第一个子字符串的哈希之后,我们可以使用第一个哈希,从子字符串中删除的字符以及添加的字符来计算第二个子字符串的哈希对此:

"this is the text we are "  ->  hash1
"his is the text we are c"  ->  hash1 -t +c  ->  hash2

Such a "rolling hash" function isn't necessarily one for which finding two strings that have the same hash is only a remote possibility, as it would be in cryptographic hash functions. 这种“滚动散列”功能不一定是一种功能,对于它来说,找到两个具有相同散列的字符串只是一个很小的可能性,就像在密码散列功能中那样。 So the fact that the hash is the same doesn't guarantee that the substring is the same as the search string; 因此,散列相同的事实并不能保证子字符串与搜索字符串相同。 therefor we need to do a full string compare to be sure. 因此,我们需要做一个完整的字符串比较以确保。

Note that any hash function which creates a hash that is shorter than the input will necessarily have collisions. 请注意,任何创建比输入短的哈希的哈希函数必然会发生冲突。 And using a hash that is much shorter than the input string is the point of the Rabin Karp algorithm; Rabin Karp算法的重点是使用比输入字符串短得多的哈希值。 comparing the hashes is much more efficient than comparing long strings. 比较散列比比较长字符串要有效得多。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM