使用散列在另一个字符串中搜索子字符串

Question

I wrote code to find a substring in another string using hashing, but it's giving me a wrong result.我编写了代码以使用散列在另一个字符串中查找子字符串，但它给了我错误的结果。

A description of how the code works:代码工作原理的描述：

Store the first n powers of p=31 in array pows .将p=31的前n幂存储在数组pows 。
Store hashes for each substring s[0..i] in the array h .将每个子串s[0..i]哈希值存储在数组h 。
Calculate the hash for each substring of length 9 using the h array and store it in a set.使用h数组计算每个长度为 9 的子串的哈希值并将其存储在一个集合中。
Hash the string t and store its hash.散列字符串t并存储其散列值。
Compare the hash of t and hashes in the set.比较t哈希值和集合中的哈希值。

The hash h[n2-1] should exist in the set but it does not.散列h[n2-1]应该存在于集合中，但它不存在。 Could you help me find the bug in the code?你能帮我找出代码中的错误吗？

Note: When I use the modular inverse instead of multiplying by pows[i-8] the code runs well.注意：当我使用模逆而不是乘以pows[i-8] ，代码运行良好。


#include <bits/stdc++.h>

#define m 1000000007
#define N (int)2e6 + 3

using namespace std;

long long pows[N], h[N], h2[N];

set<int> ss;

int main() {

    string s = "www.cplusplus.com/forum";

    // powers array
    pows[0] = 1;
    int n = s.length(), p = 31;
    for (int i = 1; i < n; i++) {
        pows[i] = pows[i - 1] * p;
        pows[i] %= m;
    }

    // hash from 0 to i array
    h[0] = s[0] - 'a' + 1;
    for (int i = 1; i < n; i++) {
        h[i] = h[i - 1] + (s[i] - 'a' + 1) * pows[i];
        h[i] %= m;
    }

    // storing each hash with 9 characters in a set
    ss.insert(h[8]);
    for (int i = 9; i < n; i++) {
        int tp = h[i] - h[i - 9] * pows[i - 8];
        tp %= m;
        tp += m;
        tp %= m;
        ss.insert(tp);
    }

    // print hashes with 9 characters
    set<int>::iterator itr = ss.begin();
    while (itr != ss.end()) {
        cout << *(itr++) << " ";
    }
    cout << endl;

    // t is the string that i want to check if it is exist in s
    string t = "cplusplus";
    int n2 = t.length();
    h2[0] = t[0] - 'a' + 1;
    for (int i = 1; i < n2; i++) {
        h2[i] = h2[i - 1] + (t[i] - 'a' + 1) * pows[i];
        h2[i] %= m;
    }
    // print t hash
    cout << h2[n2 - 1] << endl;

    return 0;
}

Answer 1

I can see two problems with your code:我可以看到您的代码有两个问题：

When you're computing hashes for substrings of length 9, you're storing the intermediate result (of type long long ) in an int variable.当您计算长度为 9 的子字符串的哈希时，您将中间结果（类型为long long ）存储在一个int变量中。 This could cause integer overflow and the hash you computed would probably be incorrect.这可能会导致整数溢出，并且您计算的哈希值可能不正确。
Given a string s = {s[0], s[1], ..., s[n-1]} , the way you're computing the hash is: h = ∑ s[i] * p^i .给定一个字符串s = {s[0], s[1], ..., s[n-1]} ，计算哈希的方法是： h = ∑ s[i] * p^i 。 In this case, given the prefix hash stored in h , the hash for a substring s[l..r] (inclusive) should be (h[r] - h[l - 1]) / p^(r-l+1) , instead of what you wrote.在这种情况下，给定存储在h的前缀哈希，子串s[l..r] （包括）的哈希应该是(h[r] - h[l - 1]) / p^(r-l+1) , 而不是你写的。 This is also why using modular inverse (which is required to perform division under modulo) is correct.这也是为什么使用模逆（需要在模下执行除法）是正确的。

I think a more common way to compute hashes is the other way around, ie h = ∑ s[i] * p^(ni-1) .我认为计算散列的更常见方法是相反的方法，即h = ∑ s[i] * p^(ni-1) 。 This allows you to compute the substring hash as h[r] - h[l - 1] * p^(r-l+1) , which does not require computing modular inverses.这允许您将子串哈希计算为h[r] - h[l - 1] * p^(r-l+1) ，这不需要计算模逆。

使用散列在另一个字符串中搜索子字符串

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-03-26 17:29:58

使用散列在另一个字符串中搜索子字符串

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-03-26 17:29:58

解决方案1
0 已采纳 2020-03-26 17:29:58