[英]Randomized algorithm for string matching
Question: 题:
Given a text t[1...n, 1...n]
and p[1...m, 1...m]
, n = 2m
, from alphabet [0, Sigma-1]
, we say p
matches t
at [i,j]
if t[i+k-1, j+L-1] = p[k,L]
for all k,L
. 给定文本
t[1...n, 1...n]
和p[1...m, 1...m]
, n = 2m
(来自字母[0, Sigma-1]
,我们说p
如果所有k,L
t[i+k-1, j+L-1] = p[k,L]
,则在[i,j]
匹配t
。 Design a randomized algorithm to find all matches in O(n^2)
time with high probability. 设计一种随机算法,以高概率找到
O(n^2)
时间中的所有匹配项。
Image: 图片:
Can someone help me understand what this text means? 有人可以帮我理解这段文字的意思吗? I believe it is saying that 't' has two words in it and the pattern is also two words but the length of both patterns is half of 't'.
我相信这是说“ t”中有两个单词,模式也是两个单词,但是两个模式的长度都是“ t”的一半。 However, from here I don't understand how the range of [i,j] comes into play.
但是,从这里我不了解[i,j]的范围是如何起作用的。 That if statement goes over my head.
如果说那句话打动了我。
This could also be saying that t and p are 2D arrays and you are trying to match a "box" from the pattern in the t 2D array. 这也可能是说t和p是2D数组,并且您正在尝试从t 2D数组中的模式匹配“框”。
Any help would be appreciated, thank you! 任何帮助,将不胜感激,谢谢!
The problem asks you to find a 2D pattern
ie defined by the p
array in the t
array which is also 2D. 问题要求您找到一个
2D pattern
即由t
数组中的p
数组定义的2D pattern
。
The most obvious randomized solution to this problem would be to generate two random indexes i
and j
and then start searching for the pattern from that (i, j)
. 该问题最明显的随机解决方案是生成两个随机索引
i
和j
,然后从该索引中搜索模式(i, j)
。
To avoid doing redundant searches you can keep track of which pairs of (i, j)
you have visited before, this can be done using a simple look up 2D array. 为了避免进行多余的搜索,您可以跟踪以前访问过
(i, j)
对,可以使用简单的2D查找数组来完成。
The complexity of above would be O(n^3)
in the worst case. 在最坏的情况下,上述复杂度将为
O(n^3)
。
You can also use hashing
for comparing the strings to reduce the complexity to O(n^2)
. 您还可以使用
hashing
比较字符串以将复杂度降低到O(n^2)
。
You first need to hash the t
array row by row and store the value in an array like hastT
, you can use the Rolling hash algorithm for that. 首先,您需要逐行对
t
数组进行哈希处理,并将值存储在hastT
这样的数组中,您可以使用Rolling哈希算法 。
You can then hash the p
array using Rolling hash algorithm and store the hashes row by row in the array hashP
. 然后,您可以使用Rolling hash算法对
p
数组进行哈希处理并将哈希值逐行存储在数组hashP
。
Then when you generate the random pair (i, j)
, you can get the hash of the corresponding t
array using the array hashT
in linear time instead of the brute force comparision that takes quadratic time and compare (Note there can be collisions in the hash you can brute force when a hash matches to be completely sure). 然后,当您生成随机对
(i, j)
,您可以使用线性时间中的数组hashT
而不是花费二次时间进行比较的蛮力比较来获取对应t
数组的哈希(请注意,如果您完全确定哈希匹配,则可以使用蛮力哈希)。
To find the corresponding hash using the hashT
we can do the following, suppose the current pair (i, j)
is (3, 4)
, and the dimensions of the p
array are 2 x 3
. 为了使用
hashT
查找对应的哈希,我们可以执行以下操作,假设当前对(i, j)
为(3, 4)
,并且p
数组的维数为2 x 3
。
Then we can compare hashT[3][7] - hash[3][3] == hashP[3]
to find the result, the above logic comes from the rolling hash algo
. 然后我们可以比较
hashT[3][7] - hash[3][3] == hashP[3]
来找到结果,上面的逻辑来自rolling hash algo
。
Pseudocode for search in linear time using hashing : 使用散列在线性时间内搜索的伪代码:
hashT[][], hashS[]
i = rand(), j = rand();
for(int k = i;k < i + lengthOfColumn(p);i++){
if((hashT[i][j + lengthOfRow(p)] - hashT[i][j-1]) != hashP[i]){
//patter does not match.
return false;
}
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.