[英]Finding a sub-string in an array with better time complexity (preferably in perl)
I am having an array of random alphabetical strings; 我有一组随机字母字符串; the length of the array is 290K+.
阵列的长度是290K +。
Now I want to check if any of the string in the array is a sub-string of any other string present in the array. 现在我想检查数组中的任何字符串是否是数组中存在的任何其他字符串的子字符串。
My code 我的代码
for my $z (0..$seq_len-1)
{
my $seq1 = $seq[$z];
for my $y (0..$seq_len-1)
{
my $seq2 = $seq[$y];
if($z != $y)
{
# my $anything = '.*';
# my $pattern = $anything.$seq2.$anything;
if($seq1 =~ m/$seq2/)
{
push @::uniq, $identifiers[$z];
push @::duplicate, $identifiers[$y];
}
}
}
}
The code works fine but can there be a better approach to accomplish this task? 代码工作正常,但有没有更好的方法来完成这项任务?
Thanks for pointing out unnecessary usage in regexp; 感谢您指出regexp中不必要的用法; removed that but still not much of difference.
删除了但仍然没有多大区别。
Thanks in advance 提前致谢
You can use a suffix tree . 您可以使用后缀树 。
Populate the tree with all strings, and then iterate the collection, and check if any string is a prefix of some suffix in the array, which is not the initial string already. 使用所有字符串填充树,然后迭代集合,并检查是否有任何字符串是数组中某些后缀的前缀,这不是初始字符串。
The idea, if you find a suffix - which a string s
is a prefix of - it is a substring of some other string (and it is easy to find which in this DS). 这个想法,如果你找到一个后缀 - 字符串
s
是前缀 - 它是一些其他字符串的子串(并且很容易找到这个DS中的哪个)。
This solution is pretty efficient in terms of asymptotical complexity, but requires a more complex DS for you to use. 此解决方案在渐近复杂性方面非常有效,但需要更复杂的DS才能使用。
This solution runs in O(n*|S|)
- where |S|
该解决方案在
O(n*|S|)
- 中运行|S|
is the length of a string, which is much more efficient than your O(n^2*R(|S|))
- where R(|S|)
is your regex complexity. 是一个字符串的长度,它比你的
O(n^2*R(|S|))
更有效 - 其中R(|S|)
是你的正则表达式复杂度。
For starters: 对于初学者:
.*
wrapping is entirely irrelevant. .*
包装完全无关紧要。 /.*pattern.*/
will match the same things at /pattern/
. /.*pattern.*/
将匹配/pattern/
中的相同内容。 $y
) loop starts at $z
and just ensure you test the shorter for being a substring of the longer. $y
)循环从$z
开始,只是确保你测试较短的时间作为更长的子串。 A
is a substring of AB
. A
是的一个子AB
。 Which means you don't need to individually test that ABC
, ABCD
etc. match both - if they match the longer one, they match the shorter. ABC
, ABCD
等都匹配 - 如果它们匹配较长的一个,它们匹配较短的。 Whether these are worth doing depends rather on the size of your lists. 这些是否值得做,取决于列表的大小。
The following reduces the work from N 2 regexp matches to N of them. 以下内容将N 2正则表达式匹配的工作量减少到N个。 The regexp is matched against a much longer string than before, but the savings should still be quite noticeable.
正则表达式与比以前更长的字符串相匹配,但节省的费用仍应非常明显。
my $encoded_seqs = "\0" . join("\0", @seqs) . "\0";
for my $seq (@seqs) {
if (
$encoded_seqs =~ /\0 (?: \Q$seq\E [^\0]+ | [^\0]+ \Q$seq\E [^\0]* )/x
) {
print("$seq is contained by another.\n");
} else {
print("$seq is isn't contained by another.\n");
}
}
To find one of the matches: 要查找其中一个匹配项:
my $encoded_seqs = "\0" . join("\0", @seqs) . "\0";
for my $seq (@seqs) {
if (
my ($match) =
$encoded_seqs =~ /\0 ( \Q$seq\E [^\0]+ | [^\0]+ \Q$seq\E [^\0]* )/x
) {
print("$seq is contained by $match, and possibly others.\n");
} else {
print("$seq is isn't contained by another.\n");
}
}
To find all of the matches: 要查找所有匹配项:
my $encoded_seqs = "\0" . join("\0", @seqs) . "\0";
for my $seq (@seqs) {
if (
my @matches =
$encoded_seqs =~ /\0 ( \Q$seq\E [^\0]+ | [^\0]+ \Q$seq\E [^\0]* )/xg
) {
print("$seq is contained by @matches\n");
} else {
print("$seq is isn't contained by another.\n");
}
}
Possibly a little bit faster: 可能更快一点:
$encoded_seqs =~ /\0 ( (?>[^\0]*) \Q$seq\E (?>[^\0]*) ) (?<! \0 \Q$seq\E )/xg
All of the above assume that NUL can't be in any of the sequences. 所有上述假设NUL不能处于任何序列中。 If the sequences can contain any character, you can use the following instead:
如果序列可以包含任何字符,则可以使用以下代码:
# Hides "~" in a lossless way.
my @decode = qw( ! ~ );
my %encode = map { $decode[$_] => $decode[0].$_ } 0..$#decode;
sub encode(_) { return $_[0] =~ s/([!~])/$encode{$encode{$1}/gr }
sub decode(_) { return $_[0] =~ s/!(.)/$decode[$1]/sgr }
my $encoded_seqs = '~' . join('~', map encode, @seqs) . '~';
for my $seq (@seqs) {
my $encoded_seq = encode($seq);
# Use ~ instead of \0.
# Use $encoded_seq instead of $seq.
# Use decode() on the values in $match and @matches.
}
You are adding complexity and runtime here: 您在这里添加复杂性和运行时:
my $anything = '.*';
my $pattern = $anything.$seq2.$anything;
if($seq1 =~ m/$pattern/)
The .*
before and after $seq2
serve no purpose, because /foo/
is functionally identical to /.*foo.*/
. $seq2
之前和之后的.*
没有用处,因为/foo/
在功能上与/.*foo.*/
相同。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.