简体   繁体   中英

Find the common ordered characters between two strings

Given two strings, find the common characters between the two strings which are in same order from left to right.

Example 1

string_1 = 'hcarry'
string_2 = 'sallyc'

Output - 'ay'

Example 2

string_1 = 'jenny'
string_2 = 'ydjeu'

Output - 'je'

Explanation for Example 1 -

Common characters between string_1 and string_2 are c,a,y. But since c comes before ay in string_1 and after ay in string_2 , we won't consider character c in output. The order of common characters between the two strings must be maintained and must be same.

Explanation for Example 2 -

Common characters between string_1 and string_2 are j,e,y. But since y comes before je in string_2 and after je in string_1 , we won't consider character y in output. The order of common characters between the two strings must be maintained and must be same.

My approach -

  1. Find the common characters between the strings and then store it in another variable for each individual string.

Example - 

string_1 = 'hcarry'
string_2 = 'sallyc'

Common_characters = c,a,y

string_1_com = cay
string_2_com = ayc

I used sorted, counter, enumerate functions to get string_1_com and string_2_com in Python.

  1. Now find the longest common sub-sequence in between string_1_com and string_2_com . You get the output as the result.

This is the brute force solution.

What is the optimal solution for this?

The algorithm for this is just called string matching in my book. It runs in O( mn ) where m and n are the word lengths. I guess it might as well run on the full words, what's most efficient would depend on the expected number of common letters and how the sorting and filtering is performed. I will explain it for common letters strings as that's easier.

The idea is that you look at a directed acyclic graph of (m+1) * (n+1) nodes. Each path (from upper left to lower right) through this graph represents a unique way of matching the words. We want to match the strings, and additionally put in blanks ( - ) in the words so that they align with the highest number of common letters. For example the end state of cay and ayc would be

cay-
-ayc

Each node stores the highest number of matches for the partial matching which it represents, and at the end of the algorithm the end node will give us the highest number of matches.

We start at the upper left corner where nothing is matched with nothing and so we have 0 matching letters here (score 0).

    c a y
  0 . . .
a . . . .
y . . . .
c . . . .

We are to walk through this graph and for each node calculate the highest number of matching letters, by using the data from previous nodes.

The nodes are connected left->right, up->down and diagonally left-up->right-down.

  • Moving right represents consuming one letter from cay and matching the letter we arrive at with a - inserted in ayc .
  • Moving down represents the opposite (consuming from ayc and inserting - to cay ).
  • Moving diagonally represents consuming one letter from each word and matching those.

Looking at the first node to the right of our starting node it represents the matching

c
-

and this node can (obviously) only be reached from the starting node.

All nodes in first row and first column will be 0 since they all represent matching one or more letters with an equal number of - .

We get the graph

    c a y
  0 0 0 0
a 0 . . .
y 0 . . .
c 0 . . .

That was the setup, now the interesting part begins.

Looking at the first unevaluated node, which represents matching the substrings c with a , we want to decide how we can get there with the most number of matching letters.

  • Alternative 1: We can get there from the node to the left. The node to the left represents the matching
-
a

so by choosing this path to get to our current node we arrive at

-c
a-

matching c with - gives us no correct match and thus the score for this path is 0 (taken from the last node) plus 0 (score for the match c/- just made). So 0 + 0 = 0 for this path.

  • Alternative 2: We can get to this node from above, this path represents moving from
c   ->    c-
-         -a

which also gives us 0 extra points. Score for this is 0.

  • Alternative 3: We can get to this node from upper-left. This is moving from starting node (nothing at all) to consuming one character from each letter. That is matching
c
a

Since c and a is different letters we get 0 + 0 = 0 for this path as well.

    c a y
  0 0 0 0
a 0 0 . .
y 0 . . .
c 0 . . .

But for the next node it looks better. We still have the three alternatives to look at. Alternative 1 & 2 always gives us 0 extra points as they always represent matching a letter with - , so those paths will give us score 0. Let's move on to alternative 3.

For our current node moving diagonally means going from

c   ->   ca
-        -a

IT'S A MATCH!

That means there is a path to this node that gives us 1 in score. We throw away the 0s and save the 1.

    c a y
  0 0 0 0
a 0 0 1 .
y 0 . . .
c 0 . . .

For the last node on this row we look at our three alternatives and realize we won't get any new points (new matches), but we can get to the node by using our previous 1 point path:

ca   ->   cay
-a        -a-

So this node is also 1 in score.

Doing this for all nodes we get the following complete graph

    c a y
  0 0 0 0
a 0 0 1 1
y 0 0 1 2
c 0 1 1 2

where the only increases in score come from

c   ->   ca   |   ca   ->   cay   |   -   ->   -c
-        -a   |   -a        -ay   |   y        yc

An so the end node tells us the maximal match is 2 letters. Since in your case you wish to know that longest path with score 2, you need to track, for each node, the path taken as well.

This graph is easily implemented as a matrix (or an array of arrays).

I would suggest that you as elements use a tuple with one score element and one path element and in the path element you just store the aligning letters, then the elements of the final matrix will be

    c      a        y
  0 0      0        0
a 0 0      (1, a)   (1, a)
y 0 0      (1, a)   (2, ay)
c 0 (1, c) (1, a/c) (2, ay)

At one place I noted a/c , this is because string ca and ayc have two different sub-sequences of maximum length. You need to decide what to do in those cases, either just go with one or save both.

EDIT:

Here's an implementation for this solution.

def longest_common(string_1, string_2):
    len_1 = len(string_1)
    len_2 = len(string_2)
    
    m = [[(0,"") for _ in range(len_1 + 1)] for _ in range(len_2 + 1)] # intitate matrix
    
    for row in range(1, len_2+1):
        for col in range(1, len_1+1):
            diag = 0
            match = ""
            if string_1[col-1] == string_2[row-1]: # score increase with one if letters match in diagonal move
                diag = 1
                match = string_1[col - 1]
            # find best alternative
            if m[row][col-1][0] >= m[row-1][col][0] and m[row][col-1][0] >= m[row-1][col-1][0]+diag:
                m[row][col] = m[row][col-1] # path from left is best
            elif m[row-1][col][0] >= m[row-1][col-1][0]+diag:
                m[row][col] = m[row-1][col] # path from above is best
            else:
                m[row][col] = (m[row-1][col-1][0]+diag, m[row-1][col-1][1]+match) # path diagonally is best

    return m[len_2][len_1][1]
>>> print(longest_common("hcarry", "sallyc"))
ay
>>> print(longest_common("cay", "ayc"))
ay
>>> m
[[(0, ''), (0, ''), (0, ''), (0, '')],
 [(0, ''), (0, ''), (1, 'a'), (1, 'a')],
 [(0, ''), (0, ''), (1, 'a'), (2, 'ay')],
 [(0, ''), (1, 'c'), (1, 'c'), (2, 'ay')]]

Here is a simple, dynamic programming based implementation for the problem:

def lcs(X, Y): 
    m, n = len(X), len(Y)
    L = [[0 for x in xrange(n+1)] for x in xrange(m+1)] 
  
    # using a 2D Matrix for dynamic programming
    # L[i][j] stores length of longest common string for X[0:i] and Y[0:j]
    for i in range(m+1): 
        for j in range(n+1): 
            if i == 0 or j == 0: 
                L[i][j] = 0
            elif X[i-1] == Y[j-1]: 
                L[i][j] = L[i-1][j-1] + 1
            else: 
                L[i][j] = max(L[i-1][j], L[i][j-1]) 
  
    # Following code is used to find the common string 
    index = L[m][n] 
  
    # Create a character array to store the lcs string 
    lcs = [""] * (index+1) 
    lcs[index] = "" 
  
    # Start from the right-most-bottom-most corner and 
    # one by one store characters in lcs[] 
    i = m 
    j = n 
    while i > 0 and j > 0: 
  
        # If current character in X[] and Y are same, then 
        # current character is part of LCS 
        if X[i-1] == Y[j-1]: 
            lcs[index-1] = X[i-1] 
            i-=1
            j-=1
            index-=1
  
        # If not same, then find the larger of two and 
        # go in the direction of larger value 
        elif L[i-1][j] > L[i][j-1]: 
            i-=1
        else: 
            j-=1
  
    print ("".join(lcs))

But.. you have already known term "longest common subsequence" and can find numerous descriptions of dynamic programming algorithm.
Wiki link

pseudocode

function LCSLength(X[1..m], Y[1..n])
    C = array(0..m, 0..n)
    for i := 0..m
        C[i,0] = 0
    for j := 0..n
        C[0,j] = 0
    for i := 1..m
        for j := 1..n
            if X[i] = Y[j] //i-1 and j-1 if reading X & Y from zero
                C[i,j] := C[i-1,j-1] + 1
            else
                C[i,j] := max(C[i,j-1], C[i-1,j])
    return C[m,n]

function backtrack(C[0..m,0..n], X[1..m], Y[1..n], i, j)
    if i = 0 or j = 0
        return ""
    if  X[i] = Y[j]
        return backtrack(C, X, Y, i-1, j-1) + X[i]
    if C[i,j-1] > C[i-1,j]
        return backtrack(C, X, Y, i, j-1)
    return backtrack(C, X, Y, i-1, j)

Much easier solution ----- Thank you!

def f(s, s1):
 cc = list(set(s) & set(s1))
 ns = ''.join([S for S in s if S in cc])
 ns1 = ''.join([S for S in s1 if S in cc])
 found = []
 b = ns[0]
 for e in ns[1:]:
    cs = b+e
    if cs in ns1:
        found.append(cs)
    b = e
 return found

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM