簡體   English   中英

如何在O(n)的后續子字符串中計算匹配字符

[英]How to count matching characters in subsequent substring in O(n)

如何在O(n)的后續子字符串中計算匹配字符。 子字符串是通過從頭開始一次刪除一個字符來形成的。

例如:給定的字符串是ababcabab ,預期結果為8

  • Substr1: babcabab Count:0

  • Substr2: abcabab Count:2作為第一個兩個字符與給定的原始字符串匹配,第三個字符不匹配,因此檢查匹配是否停止

  • Substr3: bcabab Count:0

  • SubStr4: cabab計數:0

  • SubStr5: abab數量:4

  • SubStr6: bab次數:0

  • Substr7: ab計數:2

  • SubStr8: b計數:0

預期結果:2 + 4 + 2 = 8

您可以使用Ukkonen的算法在O(n)中創建一個后綴數組(和LCP數組),然后在O(n)中再次遍歷找到它,對原始字符串周圍的LCP值求和,找到它就變得很簡單:

    LCP SA  suffix
    0   9   .
    0   7   ab.
>   2   5   abab.
>   4   0   ababcabab.
>   2   2   abcabab.
    0   8   b.
    1   6   bab.
    3   1   babcabab.
    1   3   bcabab.
    0   4   cabab.
    0   0   ababcabab.

使用for循環(在此示例中為java):

String s = "ababcabab";
int count = 0;
    int count = 0;
    for(int i = 1; i < s.length(); i++){ // for loop for all substrings [EDIT]: starts w/ 1 instead of 0. Thanks to vincent
        String sub = s.substring(i);
        for(int j = 0; j < sub.length() && sub.toCharArray()[j] == s.toCharArray()[j]; j++) /note that for & while loops in java are very similar. stops when substring doesn't match anymore **OR** substring's end is reached
        {
            count++; // increases count for every matching char in substring in a row
        }
    }
    System.out.println("The count is: " + count);

我們可以通過得出一些邏輯結論在O(n)中解決此問題:由於所有匹配項都是相同的; 也就是說,它們匹配字符串本身; 從字符串索引i開始的任何匹配將包含在i之前開始的所有匹配(或長度允許的一部分)。 此外,長度大於其起始索引的任何匹配項都將包含從字符串開始部分到匹配開始部分的重復。 我們只需要完整記錄一次在字符串遍歷中可以找到的匹配項,而不會后退,然后推斷出其余部分。

示例(非零基礎):

"aaaaaa":
Starting on index 2, we have a match length 5. This match necessarily includes
a match of length 4 starting on index 3 (since index 3 is index 2 for the
substring that starts on index 2). Continuing the same logic, we add 3 + 2 + 1
for a total of 15, without needing to scan and compare more than Substr2.

"aabaabaa":
Starting on index 2, we have a match length 1.
Starting on index 4, we have a match length 5. This match necessarily includes
a match of length 1 starting on index 5 (since index 5 is index 2 for the
substring that starts on index 4). It also necessarily includes a match of 
length (5 - 3) starting on index 7 (since index 7 is index 4 for the substring
that starts on index 4), and this match implies another match of length 1, 
starting on index 8. Altogether 1 + 5 + 1 + (5 - 3) + 1 = 10. Again, the scan
was O(n).

"aabaabaabaabaa":
Starting on index 2, we have a match length 1.
Starting on index 4, we have a match length 11.
1 + 11 + 1 + (11 - 3) + 1 + (8 - 3) + 1 + (5 - 3) + 1 = 31.

"aabaaab":
Starting on index 2, we have a match length 1.
For repeated patterns in the beginning of the string, we can use a formula 
rather than multiple scans, so a string like "aabaaaaaaaaaab" would have the 
same complexity as the one above, (number of times the pattern repeats - number
of times the pattern repeats in the beginning of the string) * total length of
repeated pattern at the start of the string. We identify a pattern if the 
length of the first match is a multiple of its starting index. Identifying 
this pattern and using the formula also prevents erroneously missing the 
correct match to record (e.g., without it, we would have identified 'aa' and 
'a' at the end as matches and missed the 'aab'). 
So starting on index 4, we have (3 - 2) * 2 = 2
Starting on index 5, we have a match length 3.
1 + 2 + 3 + 1 = 7

"ababcabab":
Starting on index 3, we have a match length 2.
Starting on index 6, we have a match length 4.
2 + 4 + 2 = 8

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM