简体   繁体   English

我实施KMP算法有什么问题?

[英]What's wrong with my implementation of the KMP algorithm?

static void Main(string[] args)
{
    string str = "ABC ABCDAB ABCDABCDABDE";//We should add some text here for 
                                           //the performance tests.

    string pattern = "ABCDABD";


    List<int> shifts = new List<int>();

    Stopwatch stopWatch = new Stopwatch();

    stopWatch.Start();
    NaiveStringMatcher(shifts, str, pattern);
    stopWatch.Stop();
    Trace.WriteLine(String.Format("Naive string matcher {0}", stopWatch.Elapsed));

    foreach (int s in shifts)
    {
        Trace.WriteLine(s);
    }

    shifts.Clear();
    stopWatch.Restart();

    int[] pi = new int[pattern.Length];
    Knuth_Morris_Pratt(shifts, str, pattern, pi);
    stopWatch.Stop();
    Trace.WriteLine(String.Format("Knuth_Morris_Pratt {0}", stopWatch.Elapsed));

    foreach (int s in shifts)
    {
        Trace.WriteLine(s);
    }

    Console.ReadKey();
}

static IList<int> NaiveStringMatcher(List<int> shifts, string text, string pattern)
{
    int lengthText = text.Length;
    int lengthPattern = pattern.Length;

    for (int s = 0; s < lengthText - lengthPattern + 1; s++ )
    {
        if (text[s] == pattern[0])
        {
            int i = 0;
            while (i < lengthPattern)
            {
                if (text[s + i] == pattern[i])
                    i++;
                else break;
            }
            if (i == lengthPattern)
            {
                shifts.Add(s);                        
            }
        }
    }

    return shifts;
}

static IList<int> Knuth_Morris_Pratt(List<int> shifts, string text, string pattern, int[] pi)
{

    int patternLength = pattern.Length;
    int textLength = text.Length;            
    //ComputePrefixFunction(pattern, pi);

    int j;

    for (int i = 1; i < pi.Length; i++)
    {
        j = 0;
        while ((i < pi.Length) && (pattern[i] == pattern[j]))
        {
            j++;
            pi[i++] = j;
        }
    }

    int matchedSymNum = 0;

    for (int i = 0; i < textLength; i++)
    {
        while (matchedSymNum > 0 && pattern[matchedSymNum] != text[i])
            matchedSymNum = pi[matchedSymNum - 1];

        if (pattern[matchedSymNum] == text[i])
            matchedSymNum++;

        if (matchedSymNum == patternLength)
        {
            shifts.Add(i - patternLength + 1);
            matchedSymNum = pi[matchedSymNum - 1];
        }

    }

    return shifts;
}

Why does my implemention of the KMP algorithm work slower than the Naive String Matching algorithm? 为什么我的KMP算法实现比Naive String Matching算法慢?

The KMP algorithm has two phases: first it builds a table, and then it does a search, directed by the contents of the table. KMP算法有两个阶段:首先构建一个表,然后进行搜索,由表的内容指导。

The naive algorithm has one phase: it does a search. 朴素算法有一个阶段:它进行搜索。 It does that search much less efficiently in the worst case than the KMP search phase. 在最坏的情况下 ,它的搜索效率要低于KMP搜索阶段。

If the KMP is slower than the naive algorithm then that is probably because building the table is taking you longer than it takes to simply search the string naively in the first place. 如果KMP比天真算法慢,那可能是因为构建表所花费的时间比首先简单地搜索字符串花费的时间长。 Naive string matching is usually very fast on short strings. 在短字符串上,天真的字符串匹配通常非常快。 There is a reason why we don't use fancy-pants algorithms like KMP inside the BCL implementations of string searching. 有一个原因,我们不在字符串搜索的BCL实现中使用像KMP这样的花式算法算法。 By the time you set up the table, you could have done half a dozen searches of short strings with the naive algorithm. 当您设置表格时,您可以使用朴素算法对短字符串进行六次搜索。

KMP is only a win if you have enormous strings and you are doing lots of searches that allow you to re-use an already-built table. 如果您拥有庞大的字符串并且您正在进行大量搜索以允许您重新使用已构建的表,那么KMP只是一个胜利。 You need to amortize away the huge cost of building the table by doing lots of searches using that table. 您需要通过使用该表进行大量搜索来分摊构建表的巨大成本。

And also, the naive algorithm only has bad performance in bizarre and unlikely scenarios. 而且,朴素算法在奇怪和不太可能的场景中只有糟糕的表现。 Most people are searching for words like "London" in strings like "Buckingham Palace, London, England", and not searching for strings like "BANANANANANANA" in strings like "BANAN BANBAN BANBANANA BANAN BANAN BANANAN BANANANANANANANANAN...". 大多数人都在搜索像“白金汉宫,伦敦,英国”这样的字符串中的“伦敦”这样的词,而不是像“BANAN BANBAN BANBANANA BANAN BANAN BANANANANANANANANANANANANANAN ......”这样的字符串中搜索“BANANANANANANA”等字符串。 The naive search algorithm is optimal for the first problem and highly sub-optimal for the latter problem; 朴素搜索算法对于第一个问题是最优的,对于后一个问题是高度次优的; but it makes sense to optimize for the former, not the latter. 但是对前者而不是后者进行优化是有意义的。

Another way to put it: if the searched-for string is of length w and the searched-in string is of length n, then KMP is O(n) + O(w). 另一种说法:如果搜索的字符串长度为w且搜索字符串的长度为n,则KMP为O(n)+ O(w)。 The Naive algorithm is worst case O(nw), best case O(n + w). 朴素算法是最坏情况O(nw),最好情况是O(n + w)。 But that says nothing about the "constant factor"! 但这并没有说明“恒定因素”! The constant factor of the KMP algorithm is much larger than the constant factor of the naive algorithm. KMP算法的常数因子远大于朴素算法的常数因子。 The value of n has to be awfully big, and the number of sub-optimal partial matches has to be awfully large, for the KMP algorithm to win over the blazingly fast naive algorithm. n的值必须非常大,并且次优部分匹配的数量必须非常大,以使KMP算法赢得超快速的朴素算法。

That deals with the algorithmic complexity issues. 这涉及算法复杂性问题。 Your methodology is also not very good, and that might explain your results. 您的方法也不是很好,这可能会解释您的结果。 Remember, the first time you run code, the jitter has to jit the IL into assembly code. 请记住, 一次运行代码时,抖动必须将IL转换为汇编代码。 That can take longer than running the method in some cases . 在某些情况下,这可能比运行该方法花费更长的时间 You really should be running the code a few hundred thousand times in a loop, discarding the first result, and taking an average of the timings of the rest. 你真的应该在一个循环中运行几十万次代码,丢弃第一个结果,并取其余时间的平均值。

If you really want to know what is going on you should be using a profiler to determine what the hot spot is. 如果你真的想知道发生了什么,你应该使用分析器来确定热点是什么。 Again, make sure you are measuring the post-jit run, not the run where the code is jitted, if you want to have results that are not skewed by the jit time. 再次,确保您正在测量jit后运行,而不是测试代码的运行,如果您希望结果不受jit时间的影响。

Your example is too small and it does not have enough repetitions of the pattern where KMP avoids backtracking. 您的示例太小,并且没有足够的重复模式,KMP可以避免回溯。

KMP can be slower than the normal search in some cases. 在某些情况下,KMP可能比正常搜索慢。

A Simple KMPSubstringSearch Implementation. 一个简单的KMPSubstringSearch实现。

https://github.com/bharathkumarms/AlgorithmsMadeEasy/blob/master/AlgorithmsMadeEasy/KMPSubstringSearch.cs https://github.com/bharathkumarms/AlgorithmsMadeEasy/blob/master/AlgorithmsMadeEasy/KMPSubstringSearch.cs

using System;
using System.Collections.Generic;
using System.Linq;

namespace AlgorithmsMadeEasy
{
    class KMPSubstringSearch
    {
        public void KMPSubstringSearchMethod()
        {
            string text = System.Console.ReadLine();
            char[] sText = text.ToCharArray();

            string pattern = System.Console.ReadLine();
            char[] sPattern = pattern.ToCharArray();

            int forwardPointer = 1;
            int backwardPointer = 0;

            int[] tempStorage = new int[sPattern.Length];
            tempStorage[0] = 0;

            while (forwardPointer < sPattern.Length)
            {
                if (sPattern[forwardPointer].Equals(sPattern[backwardPointer]))
                {
                    tempStorage[forwardPointer] = backwardPointer + 1;
                    forwardPointer++;
                    backwardPointer++;
                }
                else
                {
                    if (backwardPointer == 0)
                    {
                        tempStorage[forwardPointer] = 0;
                        forwardPointer++;
                    }
                    else
                    {
                        int temp = tempStorage[backwardPointer];
                        backwardPointer = temp;
                    }

                }
            }

            int pointer = 0;
            int successPoints = sPattern.Length;
            bool success = false;
            for (int i = 0; i < sText.Length; i++)
            {
                if (sText[i].Equals(sPattern[pointer]))
                {
                    pointer++;
                }
                else
                {
                    if (pointer != 0)
                    {
                        int tempPointer = pointer - 1;
                        pointer = tempStorage[tempPointer];
                        i--;
                    }
                }

                if (successPoints == pointer)
                {
                    success = true;
                }
            }

            if (success)
            {
                System.Console.WriteLine("TRUE");
            }
            else
            {
                System.Console.WriteLine("FALSE");
            }
            System.Console.Read();
        }
    }
}

/*
 * Sample Input
abxabcabcaby
abcaby 
*/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM