简体   繁体   中英

perl: string matching to find longest substring

$string1 = "peachbananaapplepear";
$string2 = "juicenaapplewatermelonpear";

I want to know what's the longest common substring containing the word "apple".

$string2 =~ m/.+apple.+/;
print $string2;

So I use the match operator, and .+ for matching any character before and after the keyword "apple". When I print $string2 it doesn't return naapple but returns the original $string2 instead.

Here is one approach. First get the locations where 'apple' appears in the strings. And for each of those locations in string1, look at all locations in string2. Look to the left and right to see how far the commonality extends from the initial location.

$string1 = "peachbananaapplepear12345applegrapeapplebcdefghijk";
$string2 = "juicenaapplewatermelonpearkiwi12345applebcdefghijkberryapple";

my $SearchFor="apple";
my $SearchStrLen = length($SearchFor);

# Get locations in first string where the search term appears
my @FirstPositions =  getPostions($string1);
# Get locations in second string where the search term appears
my @SecondPositions =  getPostions($string2);

CheckForMaxMatch();

sub getPostions
{
  my $GivenString = shift;
  my @Positions;
  my $j=0;
  for (my $i=0; $i < length($GivenString); $i += ($SearchStrLen+1) )
  {
    $j = index($GivenString, $SearchFor, $i);
    if ($j == -1) {
     last;
    }
    push (@Positions, $j);
    $i = $j;
  }

  return @Positions;
}

sub CheckForMaxMatch
{
  my $MaxLeft=0;
  # From the location of 'apple', look to the left and right
  # to see how far the characters are same
  for my $i (@FirstPositions) {
    for my $j (@SecondPositions) {
      my $LeftMatchPos = getMaxMatch($i, $j, -1);
      my $RightMatchPos = getMaxMatch($i, $j, 1);

      if ( ($RightMatchPos - $LeftMatchPos) > ($MaxRight - $MaxLeft) ) {
        $MaxLeft = $LeftMatchPos;
        $MaxRight = $RightMatchPos;
      }
    }
  }

  my $LongestSubString = substr($string1, $MaxLeft, $MaxRight-$MaxLeft);
  print "Longest common substring is: $LongestSubString\n";
  print "It begins at $MaxLeft and ends at $MaxRight in string1\n";
}

sub getMaxMatch
{
  my $i= shift;
  my $j= shift;
  my $direction= shift;

  my $k = ($direction >= 1 ? $SearchStrLen : 0);

  my $FirstChar = substr($string1, $i+($k * $direction), 1);
  my $SecondChar = substr($string2, $j+($k * $direction), 1);

  for ( ; $FirstChar && $SecondChar; $k++ )
  {
    $FirstChar = substr($string1, $i+($k * $direction), 1);
    $SecondChar = substr($string2, $j+($k * $direction), 1);
    if ( $FirstChar ne $SecondChar ) {
      $direction < 1 ? $k-- : "";
      my $pos = ($k ? ($i + $k * $direction) : $i);
      return $pos;
    }
  }

  return $i;
}

The =~ operator is not going to reassign the value of $string2. Try this:

$string2 =~ m/(.+apple.+)/;
my $match = $1;
print $match

Based on the general algorithm , but tracks not only the length of the current run ( @l ), but whether it includes the keyword ( @k ). Only runs that include the keyword are considered for longest run.

use strict;
use warnings;
use feature qw( say );

sub find_substrs {
   our $s;   local *s   = \shift;
   our $key; local *key = \shift;

   my @positions;
   my $position = -1;
   while (1) {
      $position = index($s, $key, $position+1);
      last if $position < 0;

      push @positions, $position;
   }

   return @positions;
}

sub lcsubstr_which_include {
   our $s1;  local *s1  = \shift;
   our $s2;  local *s2  = \shift;
   our $key; local *key = \shift;

   my @key_starts1 = find_substrs($s1, $key)
      or return;

   my @key_starts2 = find_substrs($s2, $key)
      or return;

   my @is_key_start1;  $is_key_start1[$_] = 1 for @key_starts1;
   my @is_key_start2;  $is_key_start2[$_] = 1 for @key_starts2;

   my @s1 = split(//, $s1);
   my @s2 = split(//, $s2);

   my $length = 0;
   my @rv;
   my @l = ( 0 ) x ( @s1 + 1 );  # Last ele is read when $i1==0.
   my @k = ( 0 ) x ( @s1 + 1 );  # Same.
   for my $i2 (0..$#s2) {
      for my $i1 (reverse 0..$#s1) {
         if ($s1[$i1] eq $s2[$i2]) {
            $l[$i1] = $l[$i1-1] + 1;
            $k[$i1] = $k[$i1-1] || ( $is_key_start1[$i1] && $is_key_start2[$i2] );

            if ($k[$i1]) {
               if ($l[$i1] > $length) {
                  $length = $l[$i1];
                  @rv = [ $i1, $i2, $length ];
               }
               elsif ($l[$i1] == $length) {
                  push @rv, [ $i1, $i2, $length ];
               }
            }
         } else {
            $l[$i1] = 0;
            $k[$i1] = 0;
         }
      }
   }

   for (@rv) {
      $_->[0] -= $length;
      $_->[1] -= $length;
   }

   return @rv;
}

{
   my $s1 = "peachbananaapplepear";
   my $s2 = "juicenaapplewatermelonpear";
   my $key = "apple";

   for (lcsubstr_which_include($s1, $s2, $key)) {
      my ($s1_pos, $s2_pos, $length) = @$_;
      say substr($s1, $s1_pos, $length);
   }
}

This solution in O(NM), meaning it scales amazingly well (for what it does).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM