[英]Search for a substring in an another string using hashing
I wrote code to find a substring in another string using hashing, but it's giving me a wrong result.我编写了代码以使用散列在另一个字符串中查找子字符串,但它给了我错误的结果。
A description of how the code works:代码工作原理的描述:
n
powers of p=31
in array pows
.p=31
的前n
幂存储在数组pows
。s[0..i]
in the array h
.s[0..i]
哈希值存储在数组h
。h
array and store it in a set.h
数组计算每个长度为 9 的子串的哈希值并将其存储在一个集合中。t
and store its hash.t
并存储其散列值。t
and hashes in the set.t
哈希值和集合中的哈希值。 The hash h[n2-1]
should exist in the set but it does not.散列
h[n2-1]
应该存在于集合中,但它不存在。 Could you help me find the bug in the code?你能帮我找出代码中的错误吗?
Note: When I use the modular inverse instead of multiplying by pows[i-8]
the code runs well.注意:当我使用模逆而不是乘以
pows[i-8]
,代码运行良好。
#include <bits/stdc++.h>
#define m 1000000007
#define N (int)2e6 + 3
using namespace std;
long long pows[N], h[N], h2[N];
set<int> ss;
int main() {
string s = "www.cplusplus.com/forum";
// powers array
pows[0] = 1;
int n = s.length(), p = 31;
for (int i = 1; i < n; i++) {
pows[i] = pows[i - 1] * p;
pows[i] %= m;
}
// hash from 0 to i array
h[0] = s[0] - 'a' + 1;
for (int i = 1; i < n; i++) {
h[i] = h[i - 1] + (s[i] - 'a' + 1) * pows[i];
h[i] %= m;
}
// storing each hash with 9 characters in a set
ss.insert(h[8]);
for (int i = 9; i < n; i++) {
int tp = h[i] - h[i - 9] * pows[i - 8];
tp %= m;
tp += m;
tp %= m;
ss.insert(tp);
}
// print hashes with 9 characters
set<int>::iterator itr = ss.begin();
while (itr != ss.end()) {
cout << *(itr++) << " ";
}
cout << endl;
// t is the string that i want to check if it is exist in s
string t = "cplusplus";
int n2 = t.length();
h2[0] = t[0] - 'a' + 1;
for (int i = 1; i < n2; i++) {
h2[i] = h2[i - 1] + (t[i] - 'a' + 1) * pows[i];
h2[i] %= m;
}
// print t hash
cout << h2[n2 - 1] << endl;
return 0;
}
I can see two problems with your code:我可以看到您的代码有两个问题:
long long
) in an int
variable.long long
)存储在一个int
变量中。 This could cause integer overflow and the hash you computed would probably be incorrect.s = {s[0], s[1], ..., s[n-1]}
, the way you're computing the hash is: h = ∑ s[i] * p^i
.s = {s[0], s[1], ..., s[n-1]}
,计算哈希的方法是: h = ∑ s[i] * p^i
。 In this case, given the prefix hash stored in h
, the hash for a substring s[l..r]
(inclusive) should be (h[r] - h[l - 1]) / p^(r-l+1)
, instead of what you wrote.h
的前缀哈希,子串s[l..r]
(包括)的哈希应该是(h[r] - h[l - 1]) / p^(r-l+1)
, 而不是你写的。 This is also why using modular inverse (which is required to perform division under modulo) is correct. I think a more common way to compute hashes is the other way around, ie h = ∑ s[i] * p^(ni-1)
.我认为计算散列的更常见方法是相反的方法,即
h = ∑ s[i] * p^(ni-1)
。 This allows you to compute the substring hash as h[r] - h[l - 1] * p^(r-l+1)
, which does not require computing modular inverses.这允许您将子串哈希计算为
h[r] - h[l - 1] * p^(r-l+1)
,这不需要计算模逆。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.