简体   繁体   English

使用散列在另一个字符串中搜索子字符串

[英]Search for a substring in an another string using hashing

I wrote code to find a substring in another string using hashing, but it's giving me a wrong result.我编写了代码以使用散列在另一个字符串中查找子字符串,但它给了我错误的结果。

A description of how the code works:代码工作原理的描述:

  1. Store the first n powers of p=31 in array pows .p=31的前n幂存储在数组pows
  2. Store hashes for each substring s[0..i] in the array h .将每个子串s[0..i]哈希值存储在数组h
  3. Calculate the hash for each substring of length 9 using the h array and store it in a set.使用h数组计算每个长度为 9 的子串的哈希值并将其存储在一个集合中。
  4. Hash the string t and store its hash.散列字符串t并存储其散列值。
  5. Compare the hash of t and hashes in the set.比较t哈希值和集合中的哈希值。

The hash h[n2-1] should exist in the set but it does not.散列h[n2-1]应该存在于集合中,但它不存在。 Could you help me find the bug in the code?你能帮我找出代码中的错误吗?

Note: When I use the modular inverse instead of multiplying by pows[i-8] the code runs well.注意:当我使用模逆而不是乘以pows[i-8] ,代码运行良好。


#include <bits/stdc++.h>

#define m 1000000007
#define N (int)2e6 + 3

using namespace std;

long long pows[N], h[N], h2[N];

set<int> ss;

int main() {

    string s = "www.cplusplus.com/forum";

    // powers array
    pows[0] = 1;
    int n = s.length(), p = 31;
    for (int i = 1; i < n; i++) {
        pows[i] = pows[i - 1] * p;
        pows[i] %= m;
    }

    // hash from 0 to i array
    h[0] = s[0] - 'a' + 1;
    for (int i = 1; i < n; i++) {
        h[i] = h[i - 1] + (s[i] - 'a' + 1) * pows[i];
        h[i] %= m;
    }

    // storing each hash with 9 characters in a set
    ss.insert(h[8]);
    for (int i = 9; i < n; i++) {
        int tp = h[i] - h[i - 9] * pows[i - 8];
        tp %= m;
        tp += m;
        tp %= m;
        ss.insert(tp);
    }

    // print hashes with 9 characters
    set<int>::iterator itr = ss.begin();
    while (itr != ss.end()) {
        cout << *(itr++) << " ";
    }
    cout << endl;

    // t is the string that i want to check if it is exist in s
    string t = "cplusplus";
    int n2 = t.length();
    h2[0] = t[0] - 'a' + 1;
    for (int i = 1; i < n2; i++) {
        h2[i] = h2[i - 1] + (t[i] - 'a' + 1) * pows[i];
        h2[i] %= m;
    }
    // print t hash
    cout << h2[n2 - 1] << endl;

    return 0;
}

I can see two problems with your code:我可以看到您的代码有两个问题:

  1. When you're computing hashes for substrings of length 9, you're storing the intermediate result (of type long long ) in an int variable.当您计算长度为 9 的子字符串的哈希时,您将中间结果(类型为long long )存储在一个int变量中。 This could cause integer overflow and the hash you computed would probably be incorrect.这可能会导致整数溢出,并且您计算的哈希值可能不正确。
  2. Given a string s = {s[0], s[1], ..., s[n-1]} , the way you're computing the hash is: h = ∑ s[i] * p^i .给定一个字符串s = {s[0], s[1], ..., s[n-1]} ,计算哈希的方法是: h = ∑ s[i] * p^i In this case, given the prefix hash stored in h , the hash for a substring s[l..r] (inclusive) should be (h[r] - h[l - 1]) / p^(r-l+1) , instead of what you wrote.在这种情况下,给定存储在h的前缀哈希,子串s[l..r] (包括)的哈希应该是(h[r] - h[l - 1]) / p^(r-l+1) , 而不是你写的。 This is also why using modular inverse (which is required to perform division under modulo) is correct.这也是为什么使用模逆(需要在模下执行除法)是正确的。

I think a more common way to compute hashes is the other way around, ie h = ∑ s[i] * p^(ni-1) .我认为计算散列的更常见方法是相反的方法,即h = ∑ s[i] * p^(ni-1) This allows you to compute the substring hash as h[r] - h[l - 1] * p^(r-l+1) , which does not require computing modular inverses.这允许您将子串哈希计算为h[r] - h[l - 1] * p^(r-l+1) ,这不需要计算模逆。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM