简体   繁体   English

2模式字符串匹配算法

[英]Algorithm for 2 Pattern String Matching

I need to code an algorithm for longest two pattern prefix/suffix match whose time complexity is O(n+m1+m2), where n is the length of the String and m1, m2 are the lengths of pattern1 and pattern2 respectively. 我需要为最长的两个模式前缀/后缀匹配编码算法,其时间复杂度为O(n + m1 + m2),其中n是String的长度,m1,m2分别是pattern1和pattern2的长度。

Example: if the String is "OBANTAO" and Pattern1 is "BANANA" and Patten2 is "SIESTA" then the answer is the substring "BANTA" of String that consists of the prefix BAN of BANANA and the suffix TA of SIESTA. 示例:如果字符串为“OBANTAO”且Pattern1为“BANANA”且Patten2为“SIESTA”,则答案为字符串的子字符串“BANTA”,其由BANANA的前缀BAN和SIESTA的后缀TA组成。

Results from Google are: "Rabin-karp string search algorithm", "Knuth-morris-pratt algorithm" and "Boyer-moore string search algorithm". 谷歌的结果是:“Rabin-karp字符串搜索算法”,“Knuth-morris-pratt算法”和“Boyer-moore字符串搜索算法”。

I'm able to understand all the above 3 algorithms, but the problem is that, they are all based on "single pattern prefix/suffix matching". 我能够理解以上所有3种算法,但问题在于,它们都基于“单一模式前缀/后缀匹配”。 I'm unable to extend them for two pattern prefix/suffix match. 我无法为两个模式前缀/后缀匹配扩展它们。

A sample algorithm or a link towards searching it would be greatly helpful for me in developing the program. 一个示例算法或搜索它的链接对我开发程序非常有帮助。

Knuth--Morris--Pratt can be modified straightforwardly to yield, for each position in the haystack string, the length of the longest prefix of the needle string with a match ending at that position. Knuth - Morris - Pratt可以直接修改,为干草堆字符串中的每个位置产生针串的最长前缀的长度,其匹配结束于该位置。 Use KMP to compute this information for Pat1 in String and reverse(Pat2) in reverse(String), then iterate over each position in String looking for the maximum prefix/suffix length. 使用KMP以字符串形式计算Pat1的此信息,并以反向(字符串)反向(Pat2),然后遍历String中的每个位置,查找最大前缀/后缀长度。

Example with String = "OBANTAO" and Pat1 = "BANANA" and Pat2 = "SIESTA": String =“OBANTAO”和Pat1 =“BANANA”和Pat2 =“SIESTA”的示例:

"BANANA" = Pat1 in String
 O B A N T A O
^ ^ ^ ^ ^ ^ ^ ^
| | | | | | | |
| | | | | | | 0 ("")
| | | | | | 0 ("")
| | | | | 0 ("")
| | | | 3 ("BAN")
| | | 2 ("BA")
| | 1 ("B")
| 0 ("")
0 ("")

"ATSEIS" = reverse(Pat2) in reverse(String)
 O A T N A B O
^ ^ ^ ^ ^ ^ ^ ^
| | | | | | | |
| | | | | | | 0 ("")
| | | | | | 0 ("")
| | | | | 1 ("A")
| | | | 0 ("")
| | | 2 ("AT")
| | 1 ("A")
| 0 ("")
0 ("")

Reverse the second array and sum componentwise. 反转第二个数组并按分数求和。

  0 0 1 2 3 0 0 0
+ 0 0 1 0 2 1 0 0
-----------------
  0 0 2 2 5 1 0 0
          ^
          |
         max (argument = "BAN" ++ reverse("AT"))

I tried to Implement @David Eisenstat solution in Java. 我尝试用Java实现@David Eisenstat解决方案。 It is in O(2n) time and O(2(n+1)) Auxiliary Spaces 它在O(2n)时间和O(2(n + 1))辅助空间

String prefix = "BANANA";
    String suffix = "SIESTA";
    String input = "SABANANQS";

    // Prepare Inputs
    char[] prefixArray = prefix.toCharArray();
    char[] suffixArray = suffix.toCharArray();
    char[] inputArray = input.toCharArray();
    int inputLength = inputArray.length;
    int suffixLength = suffixArray.length;
    int prefixLength = prefixArray.length;
    // Auxiliary Spaces O(2(n+1))
    int[] prefixIndexs = new int[inputLength+1];
    int[] suffixIndexs = new int[inputLength+1];

    int m = 0;
    int n = 0;
    // O(1)
    for (int i = 0; i < inputLength; i++) {
        if (inputArray[i] == prefixArray[m]){
            m = m+1;
            prefixIndexs[i+1] = m;
            if (m == prefixLength) {
                m = 0;
            }
        }else{
            m = 0;
        }
        if (inputArray[inputLength-1-i] == suffixArray[suffixLength-1-n]){   // Reverse order or input and reverse oder of suffix
            n = n +1;
            suffixIndexs[i+1] = n;
            if (n == suffixLength) {
                n = 0;
            }
        }else{
            n = 0;
        }
    }

    int currmax =0;
    int mIndex = 0; // prefix Index from start
    int nIndex = 0; // suffix Index from End
    //O(1)  - Do Sum and find the max
    for (int i = 0; i < (inputLength+1); i++) {
        m = prefixIndexs[i];
        n = suffixIndexs[inputLength-i];
        if ( m != 0 && n != 0){  // if both prefix and suffix exists
            if (m+n > currmax){
                currmax = (m+n);
                mIndex = m;
                nIndex = n;
            }
        }
    }

    System.out.println("Input :"+input);
    System.out.println("prefix :"+prefix);
    System.out.println("suffix :"+suffix);
    System.out.println("max :"+currmax);
    System.out.println("mIndex :"+mIndex);
    System.out.println("nIndex :"+nIndex);
    System.out.println(prefix.substring(0,mIndex)+suffix.substring(suffix.length() - nIndex,suffix.length()));

Instead of reset m and n to 0, we can keep an another array for each to implement KMP algorithm , since the input prefix and sufix doesnt has repetitive char sequence i left it 而不是将m和n重置为0,我们可以为每个实现KMP算法保留另一个数组,因为输入前缀和sufix没有重复的char序列我离开了它

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM