简体   繁体   English

字符串匹配的随机算法

[英]Randomized algorithm for string matching

Question: 题:

Given a text t[1...n, 1...n] and p[1...m, 1...m] , n = 2m , from alphabet [0, Sigma-1] , we say p matches t at [i,j] if t[i+k-1, j+L-1] = p[k,L] for all k,L . 给定文本t[1...n, 1...n]p[1...m, 1...m]n = 2m (来自字母[0, Sigma-1] ,我们说p如果所有k,L t[i+k-1, j+L-1] = p[k,L] ,则在[i,j]匹配t Design a randomized algorithm to find all matches in O(n^2) time with high probability. 设计一种随机算法,以高概率找到O(n^2)时间中的所有匹配项。

Image: 图片:

在此处输入图片说明

Can someone help me understand what this text means? 有人可以帮我理解这段文字的意思吗? I believe it is saying that 't' has two words in it and the pattern is also two words but the length of both patterns is half of 't'. 我相信这是说“ t”中有两个单词,模式也是两个单词,但是两个模式的长度都是“ t”的一半。 However, from here I don't understand how the range of [i,j] comes into play. 但是,从这里我不了解[i,j]的范围是如何起作用的。 That if statement goes over my head. 如果说那句话打动了我。

This could also be saying that t and p are 2D arrays and you are trying to match a "box" from the pattern in the t 2D array. 这也可能是说t和p是2D数组,并且您正在尝试从t 2D数组中的模式匹配“框”。

Any help would be appreciated, thank you! 任何帮助,将不胜感激,谢谢!

The problem asks you to find a 2D pattern ie defined by the p array in the t array which is also 2D. 问题要求您找到一个2D pattern即由t数组中的p数组定义的2D pattern

The most obvious randomized solution to this problem would be to generate two random indexes i and j and then start searching for the pattern from that (i, j) . 该问题最明显的随机解决方案是生成两个随机索引ij ,然后从该索引中搜索模式(i, j)

To avoid doing redundant searches you can keep track of which pairs of (i, j) you have visited before, this can be done using a simple look up 2D array. 为了避免进行多余的搜索,您可以跟踪以前访问过(i, j)对,可以使用简单的2D查找数组来完成。

The complexity of above would be O(n^3) in the worst case. 在最坏的情况下,上述复杂度将为O(n^3)


You can also use hashing for comparing the strings to reduce the complexity to O(n^2) . 您还可以使用hashing比较字符串以将复杂度降低到O(n^2)

You first need to hash the t array row by row and store the value in an array like hastT , you can use the Rolling hash algorithm for that. 首先,您需要逐行对t数组进行哈希处理,并将值存储在hastT这样的数组中,您可以使用Rolling哈希算法

You can then hash the p array using Rolling hash algorithm and store the hashes row by row in the array hashP . 然后,您可以使用Rolling hash算法对p数组进行哈希处理并将哈希值逐行存储在数组hashP

Then when you generate the random pair (i, j) , you can get the hash of the corresponding t array using the array hashT in linear time instead of the brute force comparision that takes quadratic time and compare (Note there can be collisions in the hash you can brute force when a hash matches to be completely sure). 然后,当您生成随机对(i, j) ,您可以使用线性时间中的数组hashT而不是花费二次时间进行比较的蛮力比较来获取对应t数组的哈希(请注意,如果您完全确定哈希匹配,则可以使用蛮力哈希)。

To find the corresponding hash using the hashT we can do the following, suppose the current pair (i, j) is (3, 4) , and the dimensions of the p array are 2 x 3 . 为了使用hashT查找对应的哈希,我们可以执行以下操作,假设当前对(i, j)(3, 4) ,并且p数组的维数为2 x 3

Then we can compare hashT[3][7] - hash[3][3] == hashP[3] to find the result, the above logic comes from the rolling hash algo . 然后我们可以比较hashT[3][7] - hash[3][3] == hashP[3]来找到结果,上面的逻辑来自rolling hash algo

Pseudocode for search in linear time using hashing : 使用散列在线性时间内搜索的伪代码:

hashT[][], hashS[]

i = rand(), j = rand();

for(int k = i;k < i + lengthOfColumn(p);i++){
    if((hashT[i][j + lengthOfRow(p)] - hashT[i][j-1]) != hashP[i]){
        //patter does not match.
        return false;
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM