简体   繁体   English

Rabin Karp字符串匹配算法

[英]Rabin Karp string matching algorithm

I've seen this Rabin Karp string matching algorithm in the forums on the website and I'm interested in trying to implement it but I was wondering If anyone could tell me why the variables ulong Q and ulong D are 100007 and 256 respectively :S? 我在网站的论坛上看到了这个Rabin Karp字符串匹配算法,我有兴趣尝试实现它,但我想知道如果有人能告诉我为什么变量ulong Q和ulong D分别是100007和256:S ? What significance do these values carry with them? 这些价值观带有什么意义?

static void Main(string[] args)
{
    string A = "String that contains a pattern.";
    string B = "pattern";
    ulong siga = 0;
    ulong sigb = 0;
    ulong Q = 100007;
    ulong D = 256;
    for (int i = 0; i < B.Length; i++)
    {
        siga = (siga * D + (ulong)A[i]) % Q;
        sigb = (sigb * D + (ulong)B[i]) % Q;
    }
    if (siga == sigb)
    {
        Console.WriteLine(string.Format(">>{0}<<{1}", A.Substring(0, B.Length), A.Substring(B.Length)));
        return;
    }
    ulong pow = 1;
    for (int k = 1; k <= B.Length - 1; k++)
        pow = (pow * D) % Q;

    for (int j = 1; j <= A.Length - B.Length; j++)
    {
        siga = (siga + Q - pow * (ulong)A[j - 1] % Q) % Q;
        siga = (siga * D + (ulong)A[j + B.Length - 1]) % Q;
        if (siga == sigb)
        {
            if (A.Substring(j, B.Length) == B)
            {
                Console.WriteLine(string.Format("{0}>>{1}<<{2}", A.Substring(0, j),
                                                                    A.Substring(j, B.Length),
                                                                    A.Substring(j + B.Length)));
                return;
            }
        }
    }
    Console.WriteLine("Not copied!");
}

About the magic numbers Paul's answer is pretty clear. 关于神奇的数字保罗的回答非常明确。

As far as the code is concerned, Rabin Karp's principal idea is to perform an hash comparison between a sliding portion of the string and the pattern. 就代码而言,Rabin Karp的主要思想是在字符串的滑动部分和模式之间执行哈希比较。

The hash cannot be computed each time on the whole substrings, otherwise the computation complexity would be quadratic O(n^2) instead of linear O(n) . 每次在整个子串上都不能计算散列,否则计算复杂度将是二次O(n^2)而不是线性O(n)

Therefore, a rolling hash function is applied, such as at each iteration only one character is needed to update the hash value of the substring. 因此,应用滚动散列函数,例如在每次迭代时,仅需要一个字符来更新子串的散列值。

So, let's comment your code: 那么,让我们评论你的代码:

for (int i = 0; i < B.Length; i++)
{
    siga = (siga * D + (ulong)A[i]) % Q;
    sigb = (sigb * D + (ulong)B[i]) % Q;
}
if (siga == sigb)
{
    Console.WriteLine(string.Format(">>{0}<<{1}", A.Substring(0, B.Length), A.Substring(B.Length)));
    return;
}

^ This piece computes the hash of pattern B ( sigb ), and the hashcode of the initial substring of A of the same length of B . ^这个片段计算模式Bsigb )的散列,以及相同长度BA的初始子串的散列码。 Actually it's not completely correct because hash can collide¹ and so, it is necessary to modify the if statement : if (siga == sigb && A.Substring(0, B.Length) == B) . 实际上它并不完全正确,因为哈希可以碰撞¹所以,有必要修改if语句: if (siga == sigb && A.Substring(0, B.Length) == B)

ulong pow = 1;
for (int k = 1; k <= B.Length - 1; k++)
    pow = (pow * D) % Q;

^ Here's computed pow that is necessary to perform the rolling hash. ^这是计算执行滚动哈希所必需的pow

for (int j = 1; j <= A.Length - B.Length; j++)
{
    siga = (siga + Q - pow * (ulong)A[j - 1] % Q) % Q;
    siga = (siga * D + (ulong)A[j + B.Length - 1]) % Q;
    if (siga == sigb)
    {
        if (A.Substring(j, B.Length) == B)
        {
            Console.WriteLine(string.Format("{0}>>{1}<<{2}", A.Substring(0, j),
                                                                A.Substring(j, B.Length),
                                                                A.Substring(j + B.Length)));
            return;
        }
    }
}

^ Finally, the remaining string (ie from the second character to end), is scanned updating the hash value of the A substring and compared with the hash of B (computed at the beginning). ^最后,扫描剩余的字符串(即从第二个字符到结尾),更新A子串的哈希值,并与B的哈希值(在开头计算)进行比较。

If the two hashes are equal, the substring and the pattern are compared¹ and if they're actually equal a message is returned. 如果两个哈希值相等,则比较子字符串和模式¹,如果它们实际上相等则返回一个消息。


¹ Hash values can collide ; ¹ 哈希值可能会发生碰撞 ; hence, if two strings have different hash values they're definitely different, but if the two hashes are equal they can be equal or not. 因此,如果两个字符串具有不同的散列值,它们肯定是不同的,但如果两个散列相等则它们可以相等或不相同。

The algorithm uses hashing for fast string comparison. 该算法使用散列进行快速字符串比较。 Q and D are magic numbers that the coder probably arrived at with a little bit of trial and error and give a good distribution of hash values for this particular algorithm. Q和D是编码器可能通过一些试验和错误得到的神奇数字,并且为这个特定算法提供了良好的散列值分布

You can see these types of magic numbers used for hashing many places. 您可以看到用于散列许多地方的这些类型的幻数。 The example below is the decompiled definition of the GetHashCode function of a .NET 2.0 string type: 下面的示例是.NET 2.0字符串类型的GetHashCode函数的反编译定义:

[ReliabilityContract(Consistency.WillNotCorruptState, Cer.MayFail)]
public override unsafe int GetHashCode()
{
    char* chrPointer = null;
    int num1;
    int num2;
    fixed (string str = (string)this)
    {
        num1 = 352654597;
        num2 = num1;
        int* numPointer = chrPointer;
        for (int i = this.Length; i > 0; i = i - 4)
        {
            num1 = (num1 << 5) + num1 + (num1 >> 27) ^ numPointer;
            if (i <= 2)
            {
                break;
            }
            num2 = (num2 << 5) + num2 + (num2 >> 27) ^ numPointer + (void*)4;
            numPointer = numPointer + (void*)8;
        }
    }
    return num1 + num2 * 1566083941;
}

Here is another example from a R# generated GetHashcode override function for a sample type: 以下是R#为样本类型生成的GetHashcode覆盖函数的另一个示例:

    public override int GetHashCode()
    {
        unchecked
        {
            int result = (SomeStrId != null ? SomeStrId.GetHashCode() : 0);
            result = (result*397) ^ (Desc != null ? Desc.GetHashCode() : 0);
            result = (result*397) ^ (AnotherId != null ? AnotherId.GetHashCode() : 0);
            return result;
        }
    }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM