Given two strings, find the common characters between the two strings which are in same order from left to right.
Example 1
string_1 = 'hcarry'
string_2 = 'sallyc'
Output - 'ay'
Example 2
string_1 = 'jenny'
string_2 = 'ydjeu'
Output - 'je'
Explanation for Example 1 -
Common characters between string_1
and string_2
are c,a,y. But since c
comes before ay
in string_1
and after ay
in string_2
, we won't consider character c
in output. The order of common characters between the two strings must be maintained and must be same.
Explanation for Example 2 -
Common characters between string_1
and string_2
are j,e,y. But since y
comes before je
in string_2
and after je
in string_1
, we won't consider character y
in output. The order of common characters between the two strings must be maintained and must be same.
My approach -
Example -
string_1 = 'hcarry'
string_2 = 'sallyc'
Common_characters = c,a,y
string_1_com = cay
string_2_com = ayc
I used sorted, counter, enumerate
functions to get string_1_com and string_2_com
in Python.
string_1_com and string_2_com
. You get the output as the result.This is the brute force solution.
What is the optimal solution for this?
The algorithm for this is just called string matching in my book. It runs in O( mn ) where m and n are the word lengths. I guess it might as well run on the full words, what's most efficient would depend on the expected number of common letters and how the sorting and filtering is performed. I will explain it for common letters strings as that's easier.
The idea is that you look at a directed acyclic graph of (m+1) * (n+1) nodes. Each path (from upper left to lower right) through this graph represents a unique way of matching the words. We want to match the strings, and additionally put in blanks ( -
) in the words so that they align with the highest number of common letters. For example the end state of cay
and ayc
would be
cay-
-ayc
Each node stores the highest number of matches for the partial matching which it represents, and at the end of the algorithm the end node will give us the highest number of matches.
We start at the upper left corner where nothing is matched with nothing and so we have 0 matching letters here (score 0).
c a y
0 . . .
a . . . .
y . . . .
c . . . .
We are to walk through this graph and for each node calculate the highest number of matching letters, by using the data from previous nodes.
The nodes are connected left->right, up->down and diagonally left-up->right-down.
cay
and matching the letter we arrive at with a -
inserted in ayc
.ayc
and inserting -
to cay
).Looking at the first node to the right of our starting node it represents the matching
c
-
and this node can (obviously) only be reached from the starting node.
All nodes in first row and first column will be 0 since they all represent matching one or more letters with an equal number of -
.
We get the graph
c a y
0 0 0 0
a 0 . . .
y 0 . . .
c 0 . . .
That was the setup, now the interesting part begins.
Looking at the first unevaluated node, which represents matching the substrings c
with a
, we want to decide how we can get there with the most number of matching letters.
-
a
so by choosing this path to get to our current node we arrive at
-c
a-
matching c
with -
gives us no correct match and thus the score for this path is 0 (taken from the last node) plus 0 (score for the match c/-
just made). So 0 + 0 = 0 for this path.
c -> c-
- -a
which also gives us 0 extra points. Score for this is 0.
c
a
Since c
and a
is different letters we get 0 + 0 = 0 for this path as well.
c a y
0 0 0 0
a 0 0 . .
y 0 . . .
c 0 . . .
But for the next node it looks better. We still have the three alternatives to look at. Alternative 1 & 2 always gives us 0 extra points as they always represent matching a letter with -
, so those paths will give us score 0. Let's move on to alternative 3.
For our current node moving diagonally means going from
c -> ca
- -a
IT'S A MATCH!
That means there is a path to this node that gives us 1 in score. We throw away the 0s and save the 1.
c a y
0 0 0 0
a 0 0 1 .
y 0 . . .
c 0 . . .
For the last node on this row we look at our three alternatives and realize we won't get any new points (new matches), but we can get to the node by using our previous 1 point path:
ca -> cay
-a -a-
So this node is also 1 in score.
Doing this for all nodes we get the following complete graph
c a y
0 0 0 0
a 0 0 1 1
y 0 0 1 2
c 0 1 1 2
where the only increases in score come from
c -> ca | ca -> cay | - -> -c
- -a | -a -ay | y yc
An so the end node tells us the maximal match is 2 letters. Since in your case you wish to know that longest path with score 2, you need to track, for each node, the path taken as well.
This graph is easily implemented as a matrix (or an array of arrays).
I would suggest that you as elements use a tuple
with one score
element and one path
element and in the path element you just store the aligning letters, then the elements of the final matrix will be
c a y
0 0 0 0
a 0 0 (1, a) (1, a)
y 0 0 (1, a) (2, ay)
c 0 (1, c) (1, a/c) (2, ay)
At one place I noted a/c
, this is because string ca
and ayc
have two different sub-sequences of maximum length. You need to decide what to do in those cases, either just go with one or save both.
EDIT:
Here's an implementation for this solution.
def longest_common(string_1, string_2):
len_1 = len(string_1)
len_2 = len(string_2)
m = [[(0,"") for _ in range(len_1 + 1)] for _ in range(len_2 + 1)] # intitate matrix
for row in range(1, len_2+1):
for col in range(1, len_1+1):
diag = 0
match = ""
if string_1[col-1] == string_2[row-1]: # score increase with one if letters match in diagonal move
diag = 1
match = string_1[col - 1]
# find best alternative
if m[row][col-1][0] >= m[row-1][col][0] and m[row][col-1][0] >= m[row-1][col-1][0]+diag:
m[row][col] = m[row][col-1] # path from left is best
elif m[row-1][col][0] >= m[row-1][col-1][0]+diag:
m[row][col] = m[row-1][col] # path from above is best
else:
m[row][col] = (m[row-1][col-1][0]+diag, m[row-1][col-1][1]+match) # path diagonally is best
return m[len_2][len_1][1]
>>> print(longest_common("hcarry", "sallyc"))
ay
>>> print(longest_common("cay", "ayc"))
ay
>>> m
[[(0, ''), (0, ''), (0, ''), (0, '')],
[(0, ''), (0, ''), (1, 'a'), (1, 'a')],
[(0, ''), (0, ''), (1, 'a'), (2, 'ay')],
[(0, ''), (1, 'c'), (1, 'c'), (2, 'ay')]]
Here is a simple, dynamic programming based implementation for the problem:
def lcs(X, Y):
m, n = len(X), len(Y)
L = [[0 for x in xrange(n+1)] for x in xrange(m+1)]
# using a 2D Matrix for dynamic programming
# L[i][j] stores length of longest common string for X[0:i] and Y[0:j]
for i in range(m+1):
for j in range(n+1):
if i == 0 or j == 0:
L[i][j] = 0
elif X[i-1] == Y[j-1]:
L[i][j] = L[i-1][j-1] + 1
else:
L[i][j] = max(L[i-1][j], L[i][j-1])
# Following code is used to find the common string
index = L[m][n]
# Create a character array to store the lcs string
lcs = [""] * (index+1)
lcs[index] = ""
# Start from the right-most-bottom-most corner and
# one by one store characters in lcs[]
i = m
j = n
while i > 0 and j > 0:
# If current character in X[] and Y are same, then
# current character is part of LCS
if X[i-1] == Y[j-1]:
lcs[index-1] = X[i-1]
i-=1
j-=1
index-=1
# If not same, then find the larger of two and
# go in the direction of larger value
elif L[i-1][j] > L[i][j-1]:
i-=1
else:
j-=1
print ("".join(lcs))
But.. you have already known term "longest common subsequence" and can find numerous descriptions of dynamic programming algorithm.
Wiki link
pseudocode
function LCSLength(X[1..m], Y[1..n])
C = array(0..m, 0..n)
for i := 0..m
C[i,0] = 0
for j := 0..n
C[0,j] = 0
for i := 1..m
for j := 1..n
if X[i] = Y[j] //i-1 and j-1 if reading X & Y from zero
C[i,j] := C[i-1,j-1] + 1
else
C[i,j] := max(C[i,j-1], C[i-1,j])
return C[m,n]
function backtrack(C[0..m,0..n], X[1..m], Y[1..n], i, j)
if i = 0 or j = 0
return ""
if X[i] = Y[j]
return backtrack(C, X, Y, i-1, j-1) + X[i]
if C[i,j-1] > C[i-1,j]
return backtrack(C, X, Y, i, j-1)
return backtrack(C, X, Y, i-1, j)
Much easier solution ----- Thank you!
def f(s, s1):
cc = list(set(s) & set(s1))
ns = ''.join([S for S in s if S in cc])
ns1 = ''.join([S for S in s1 if S in cc])
found = []
b = ns[0]
for e in ns[1:]:
cs = b+e
if cs in ns1:
found.append(cs)
b = e
return found
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.