简体   繁体   English

在数组中查找具有更好时间复杂度的子字符串(最好是在perl中)

[英]Finding a sub-string in an array with better time complexity (preferably in perl)

I am having an array of random alphabetical strings; 我有一组随机字母字符串; the length of the array is 290K+. 阵列的长度是290K +。

Now I want to check if any of the string in the array is a sub-string of any other string present in the array. 现在我想检查数组中的任何字符串是否是数组中存在的任何其他字符串的子字符串。

My code 我的代码

for my $z (0..$seq_len-1)
{
my $seq1 = $seq[$z];

for my $y (0..$seq_len-1)
{
    my $seq2 = $seq[$y];

    if($z != $y)
    {
#           my $anything = '.*';
#           my $pattern = $anything.$seq2.$anything;
        if($seq1 =~ m/$seq2/)
        {
            push @::uniq, $identifiers[$z];
            push @::duplicate, $identifiers[$y];
        }
    }
}
}

The code works fine but can there be a better approach to accomplish this task? 代码工作正常,但有没有更好的方法来完成这项任务?

Edit 编辑

Thanks for pointing out unnecessary usage in regexp; 感谢您指出regexp中不必要的用法; removed that but still not much of difference. 删除了但仍然没有多大区别。

Thanks in advance 提前致谢

You can use a suffix tree . 您可以使用后缀树

Populate the tree with all strings, and then iterate the collection, and check if any string is a prefix of some suffix in the array, which is not the initial string already. 使用所有字符串填充树,然后迭代集合,并检查是否有任何字符串是数组中某些后缀的前缀,这不是初始字符串。
The idea, if you find a suffix - which a string s is a prefix of - it is a substring of some other string (and it is easy to find which in this DS). 这个想法,如果你找到一个后缀 - 字符串s是前缀 - 它是一些其他字符串的子串(并且很容易找到这个DS中的哪个)。

This solution is pretty efficient in terms of asymptotical complexity, but requires a more complex DS for you to use. 此解决方案在渐近复杂性方面非常有效,但需要更复杂的DS才能使用。

This solution runs in O(n*|S|) - where |S| 该解决方案在O(n*|S|) - 中运行|S| is the length of a string, which is much more efficient than your O(n^2*R(|S|)) - where R(|S|) is your regex complexity. 是一个字符串的长度,它比你的O(n^2*R(|S|))更有效 - 其中R(|S|)是你的正则表达式复杂度。

For starters: 对于初学者:

  • You're being inefficient with your pattern. 你的模式效率低下。 The .* wrapping is entirely irrelevant. .*包装完全无关紧要。 /.*pattern.*/ will match the same things at /pattern/ . /.*pattern.*/将匹配/pattern/中的相同内容。
  • You're making pointless comparisons - you don't need to compare bidirectionally at all, because when one string is longer than the other - it cannot be a sub string. 您正在进行无意义的比较 - 您根本不需要双向比较,因为当一个字符串比另一个字符串长时 - 它不能是一个子字符串。 So you can shorten your 'for' loops, so the inner ( $y ) loop starts at $z and just ensure you test the shorter for being a substring of the longer. 所以你可以缩短你的'for'循环,所以内部( $y )循环从$z开始,只是确保你测试较短的时间作为更长的子串。
  • You might find compiling some regular expressions to match each element (and reusing) will improve it - otherwise you're 'restarting' the regular expression engine each time. 您可能会发现编译一些正则表达式以匹配每个元素(并重用)将改进它 - 否则您每次都在'重新启动'正则表达式引擎。 (see - http://perldoc.perl.org/perlop.html#Regexp-Quote-Like-Operators ) (参见http://perldoc.perl.org/perlop.html#Regexp-Quote-Like-Operators
  • You should also be able to chain matches - A is a substring of AB . 你也应该能够匹配链- A是的一个子AB Which means you don't need to individually test that ABC , ABCD etc. match both - if they match the longer one, they match the shorter. 这意味着你不需要单独测试ABCABCD等都匹配 - 如果它们匹配较长的一个,它们匹配较短的。

Whether these are worth doing depends rather on the size of your lists. 这些是否值得做,取决于列表的大小。

The following reduces the work from N 2 regexp matches to N of them. 以下内容将N 2正则表达式匹配的工作量减少到N个。 The regexp is matched against a much longer string than before, but the savings should still be quite noticeable. 正则表达式与比以前更长的字符串相匹配,但节省的费用仍应非常明显。

my $encoded_seqs = "\0" . join("\0", @seqs) . "\0";
for my $seq (@seqs) {
   if (
      $encoded_seqs =~ /\0 (?: \Q$seq\E [^\0]+ | [^\0]+ \Q$seq\E [^\0]* )/x
   ) {
      print("$seq is contained by another.\n");
   } else {
      print("$seq is isn't contained by another.\n");
   }
}

To find one of the matches: 要查找其中一个匹配项:

my $encoded_seqs = "\0" . join("\0", @seqs) . "\0";
for my $seq (@seqs) {
   if (
      my ($match) =
         $encoded_seqs =~ /\0 ( \Q$seq\E [^\0]+ | [^\0]+ \Q$seq\E [^\0]* )/x
   ) {
      print("$seq is contained by $match, and possibly others.\n");
   } else {
      print("$seq is isn't contained by another.\n");
   }
}

To find all of the matches: 要查找所有匹配项:

my $encoded_seqs = "\0" . join("\0", @seqs) . "\0";
for my $seq (@seqs) {
   if (
      my @matches =
         $encoded_seqs =~ /\0 ( \Q$seq\E [^\0]+ | [^\0]+ \Q$seq\E [^\0]* )/xg
   ) {
      print("$seq is contained by @matches\n");
   } else {
      print("$seq is isn't contained by another.\n");
   }
}

Possibly a little bit faster: 可能更快一点:

$encoded_seqs =~ /\0 ( (?>[^\0]*) \Q$seq\E (?>[^\0]*) ) (?<! \0 \Q$seq\E )/xg

All of the above assume that NUL can't be in any of the sequences. 所有上述假设NUL不能处于任何序列中。 If the sequences can contain any character, you can use the following instead: 如果序列可以包含任何字符,则可以使用以下代码:

# Hides "~" in a lossless way.
my @decode = qw( ! ~ );
my %encode = map { $decode[$_] => $decode[0].$_ } 0..$#decode;
sub encode(_) { return $_[0] =~ s/([!~])/$encode{$encode{$1}/gr }
sub decode(_) { return $_[0] =~ s/!(.)/$decode[$1]/sgr }

my $encoded_seqs = '~' . join('~', map encode, @seqs) . '~';
for my $seq (@seqs) {
   my $encoded_seq = encode($seq);

   # Use ~ instead of \0.
   # Use $encoded_seq instead of $seq.
   # Use decode() on the values in $match and @matches.
}

You are adding complexity and runtime here: 您在这里添加复杂性和运行时:

    my $anything = '.*';
    my $pattern = $anything.$seq2.$anything;
    if($seq1 =~ m/$pattern/)

The .* before and after $seq2 serve no purpose, because /foo/ is functionally identical to /.*foo.*/ . $seq2之前和之后的.*没有用处,因为/foo/在功能上与/.*foo.*/相同。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM