简体   繁体   English

查找两个字符串之间的公共有序字符

[英]Find the common ordered characters between two strings

Given two strings, find the common characters between the two strings which are in same order from left to right.给定两个字符串,找出两个字符串之间从左到右顺序相同的公共字符。

Example 1示例 1

string_1 = 'hcarry'
string_2 = 'sallyc'

Output - 'ay'

Example 2示例 2

string_1 = 'jenny'
string_2 = 'ydjeu'

Output - 'je'

Explanation for Example 1 -示例 1 的说明 -

Common characters between string_1 and string_2 are c,a,y. string_1string_2之间的公共字符是 c,a,y。 But since c comes before ay in string_1 and after ay in string_2 , we won't consider character c in output.但由于c到来之前aystring_1之后aystring_2 ,我们不会考虑字符c输出。 The order of common characters between the two strings must be maintained and must be same.两个字符串之间的公共字符的顺序必须保持并且必须相同。

Explanation for Example 2 -示例 2 的说明 -

Common characters between string_1 and string_2 are j,e,y. string_1string_2之间的公共字符是 j,e,y。 But since y comes before je in string_2 and after je in string_1 , we won't consider character y in output.但由于y到来之前jestring_2jestring_1 ,我们不会考虑字符y输出。 The order of common characters between the two strings must be maintained and must be same.两个字符串之间的公共字符的顺序必须保持并且必须相同。

My approach -我的方法——

  1. Find the common characters between the strings and then store it in another variable for each individual string.找到字符串之间的公共字符,然后将其存储在每个单独字符串的另一个变量中。

Example - 

string_1 = 'hcarry'
string_2 = 'sallyc'

Common_characters = c,a,y

string_1_com = cay
string_2_com = ayc

I used sorted, counter, enumerate functions to get string_1_com and string_2_com in Python.我使用sorted, counter, enumerate函数在 Python 中获取string_1_com and string_2_com

  1. Now find the longest common sub-sequence in between string_1_com and string_2_com .现在找到string_1_com and string_2_com之间的最长公共子序列。 You get the output as the result.您将获得输出作为结果。

This is the brute force solution.这是蛮力解决方案。

What is the optimal solution for this?对此的最佳解决方案是什么?

The algorithm for this is just called string matching in my book.这个算法在我的书中被称为字符串匹配。 It runs in O( mn ) where m and n are the word lengths.它在 O( mn ) 中运行,其中mn是字长。 I guess it might as well run on the full words, what's most efficient would depend on the expected number of common letters and how the sorting and filtering is performed.我想它也可以在完整的单词上运行,最有效的将取决于预期的常见字母数量以及排序和过滤的执行方式。 I will explain it for common letters strings as that's easier.我将为常见的字母字符串解释它,因为这更容易。

The idea is that you look at a directed acyclic graph of (m+1) * (n+1) nodes.这个想法是你看一个(m+1) * (n+1)节点的有向无环图。 Each path (from upper left to lower right) through this graph represents a unique way of matching the words.通过该图的每条路径(从左上角到右下角)都代表了一种匹配单词的独特方式。 We want to match the strings, and additionally put in blanks ( - ) in the words so that they align with the highest number of common letters.我们要匹配字符串,并在单词中另外放入空格 ( - ),以便它们与最多的常见字母对齐。 For example the end state of cay and ayc would be例如cayayc的最终状态是

cay-
-ayc

Each node stores the highest number of matches for the partial matching which it represents, and at the end of the algorithm the end node will give us the highest number of matches.每个节点存储它所代表的部分匹配的最高匹配数,并且在算法结束时,端节点将为我们提供最高匹配数。

We start at the upper left corner where nothing is matched with nothing and so we have 0 matching letters here (score 0).我们从左上角开始,没有与没有匹配的地方,因此我们有 0 个匹配的字母(得分 0)。

    c a y
  0 . . .
a . . . .
y . . . .
c . . . .

We are to walk through this graph and for each node calculate the highest number of matching letters, by using the data from previous nodes.我们将遍历此图,并通过使用来自先前节点的数据,为每个节点计算匹配字母的最高数量。

The nodes are connected left->right, up->down and diagonally left-up->right-down.节点连接左->右、上->下和对角线左-上->右-下。

  • Moving right represents consuming one letter from cay and matching the letter we arrive at with a - inserted in ayc .向右移动表示从cay消耗一个字母并将我们到达的字母与 a -插入ayc
  • Moving down represents the opposite (consuming from ayc and inserting - to cay ).向下移动表示相反(从ayc消耗并插入-cay )。
  • Moving diagonally represents consuming one letter from each word and matching those.对角移动表示从每个单词中消耗一个字母并匹配这些字母。

Looking at the first node to the right of our starting node it represents the matching查看起始节点右侧的第一个节点,它表示匹配

c
-

and this node can (obviously) only be reached from the starting node.并且这个节点(显然)只能从起始节点到达。

All nodes in first row and first column will be 0 since they all represent matching one or more letters with an equal number of - .第一行和第一列中的所有节点都将为 0,因为它们都表示匹配一个或多个具有相同数量的字母-

We get the graph我们得到图形

    c a y
  0 0 0 0
a 0 . . .
y 0 . . .
c 0 . . .

That was the setup, now the interesting part begins.这就是设置,现在有趣的部分开始了。

Looking at the first unevaluated node, which represents matching the substrings c with a , we want to decide how we can get there with the most number of matching letters.查看第一个未评估的节点,它表示将子字符串ca匹配,我们想决定如何使用最多匹配的字母到达那里。

  • Alternative 1: We can get there from the node to the left.备选方案 1:我们可以从左边的节点到达那里。 The node to the left represents the matching左边的节点代表匹配
-
a

so by choosing this path to get to our current node we arrive at所以通过选择这条路径到达我们当前的节点,我们到达

-c
a-

matching c with - gives us no correct match and thus the score for this path is 0 (taken from the last node) plus 0 (score for the match c/- just made).c-匹配给我们没有正确的匹配,因此这条路径的分数是 0(取自最后一个节点)加上 0(匹配c/-分数)。 So 0 + 0 = 0 for this path.所以 0 + 0 = 0 对于这条路径。

  • Alternative 2: We can get to this node from above, this path represents moving from方案二:我们可以从上面到达这个节点,这条路径代表从
c   ->    c-
-         -a

which also gives us 0 extra points.这也给了我们 0 加分。 Score for this is 0.这方面的分数是 0。

  • Alternative 3: We can get to this node from upper-left.备选方案 3:我们可以从左上角到达该节点。 This is moving from starting node (nothing at all) to consuming one character from each letter.这是从起始节点(根本没有)转移到每个字母消耗一个字符。 That is matching那是匹配
c
a

Since c and a is different letters we get 0 + 0 = 0 for this path as well.由于ca是不同的字母,因此这条路径也得到 0 + 0 = 0。

    c a y
  0 0 0 0
a 0 0 . .
y 0 . . .
c 0 . . .

But for the next node it looks better.但是对于下一个节点,它看起来更好。 We still have the three alternatives to look at.我们仍然可以考虑三种选择。 Alternative 1 & 2 always gives us 0 extra points as they always represent matching a letter with - , so those paths will give us score 0. Let's move on to alternative 3.备选方案 1 和 2 总是给我们 0 额外分,因为它们总是代表匹配一个字母- ,所以这些路径会给我们 0 分。让我们继续备选方案 3。

For our current node moving diagonally means going from对于我们当前的节点对角移动意味着从

c   ->   ca
-        -a

IT'S A MATCH!这是一场比赛!

That means there is a path to this node that gives us 1 in score.这意味着有一条通向该节点的路径使我们的得分为 1。 We throw away the 0s and save the 1.我们扔掉 0 并保存 1。

    c a y
  0 0 0 0
a 0 0 1 .
y 0 . . .
c 0 . . .

For the last node on this row we look at our three alternatives and realize we won't get any new points (new matches), but we can get to the node by using our previous 1 point path:对于这一行的最后一个节点,我们查看了三个备选方案,并意识到我们不会获得任何新点(新匹配),但我们可以使用之前的 1 点路径到达该节点:

ca   ->   cay
-a        -a-

So this node is also 1 in score.所以这个节点的分数也是1。

Doing this for all nodes we get the following complete graph对所有节点执行此操作,我们得到以下完整图

    c a y
  0 0 0 0
a 0 0 1 1
y 0 0 1 2
c 0 1 1 2

where the only increases in score come from唯一增加的分数来自哪里

c   ->   ca   |   ca   ->   cay   |   -   ->   -c
-        -a   |   -a        -ay   |   y        yc

An so the end node tells us the maximal match is 2 letters.所以结束节点告诉我们最大匹配是 2 个字母。 Since in your case you wish to know that longest path with score 2, you need to track, for each node, the path taken as well.由于在您的情况下,您希望知道得分为 2 的最长路径,因此您还需要为每个节点跟踪所采用的路径。

This graph is easily implemented as a matrix (or an array of arrays).该图很容易实现为矩阵(或数组数组)。

I would suggest that you as elements use a tuple with one score element and one path element and in the path element you just store the aligning letters, then the elements of the final matrix will be我建议你作为元素使用一个带有一个score元素和一个path元素的tuple ,在 path 元素中你只存储对齐的字母,那么最终矩阵的元素将是

    c      a        y
  0 0      0        0
a 0 0      (1, a)   (1, a)
y 0 0      (1, a)   (2, ay)
c 0 (1, c) (1, a/c) (2, ay)

At one place I noted a/c , this is because string ca and ayc have two different sub-sequences of maximum length.在一个地方,我注意到a/c ,这是因为字符串caayc有两个不同的最大长度子序列。 You need to decide what to do in those cases, either just go with one or save both.您需要决定在这些情况下该怎么做,要么选择一个,要么两个都保存。

EDIT:编辑:

Here's an implementation for this solution.这是此解决方案的实现。

def longest_common(string_1, string_2):
    len_1 = len(string_1)
    len_2 = len(string_2)
    
    m = [[(0,"") for _ in range(len_1 + 1)] for _ in range(len_2 + 1)] # intitate matrix
    
    for row in range(1, len_2+1):
        for col in range(1, len_1+1):
            diag = 0
            match = ""
            if string_1[col-1] == string_2[row-1]: # score increase with one if letters match in diagonal move
                diag = 1
                match = string_1[col - 1]
            # find best alternative
            if m[row][col-1][0] >= m[row-1][col][0] and m[row][col-1][0] >= m[row-1][col-1][0]+diag:
                m[row][col] = m[row][col-1] # path from left is best
            elif m[row-1][col][0] >= m[row-1][col-1][0]+diag:
                m[row][col] = m[row-1][col] # path from above is best
            else:
                m[row][col] = (m[row-1][col-1][0]+diag, m[row-1][col-1][1]+match) # path diagonally is best

    return m[len_2][len_1][1]
>>> print(longest_common("hcarry", "sallyc"))
ay
>>> print(longest_common("cay", "ayc"))
ay
>>> m
[[(0, ''), (0, ''), (0, ''), (0, '')],
 [(0, ''), (0, ''), (1, 'a'), (1, 'a')],
 [(0, ''), (0, ''), (1, 'a'), (2, 'ay')],
 [(0, ''), (1, 'c'), (1, 'c'), (2, 'ay')]]

Here is a simple, dynamic programming based implementation for the problem:这是一个简单的、基于动态规划的问题实现:

def lcs(X, Y): 
    m, n = len(X), len(Y)
    L = [[0 for x in xrange(n+1)] for x in xrange(m+1)] 
  
    # using a 2D Matrix for dynamic programming
    # L[i][j] stores length of longest common string for X[0:i] and Y[0:j]
    for i in range(m+1): 
        for j in range(n+1): 
            if i == 0 or j == 0: 
                L[i][j] = 0
            elif X[i-1] == Y[j-1]: 
                L[i][j] = L[i-1][j-1] + 1
            else: 
                L[i][j] = max(L[i-1][j], L[i][j-1]) 
  
    # Following code is used to find the common string 
    index = L[m][n] 
  
    # Create a character array to store the lcs string 
    lcs = [""] * (index+1) 
    lcs[index] = "" 
  
    # Start from the right-most-bottom-most corner and 
    # one by one store characters in lcs[] 
    i = m 
    j = n 
    while i > 0 and j > 0: 
  
        # If current character in X[] and Y are same, then 
        # current character is part of LCS 
        if X[i-1] == Y[j-1]: 
            lcs[index-1] = X[i-1] 
            i-=1
            j-=1
            index-=1
  
        # If not same, then find the larger of two and 
        # go in the direction of larger value 
        elif L[i-1][j] > L[i][j-1]: 
            i-=1
        else: 
            j-=1
  
    print ("".join(lcs))

But.. you have already known term "longest common subsequence" and can find numerous descriptions of dynamic programming algorithm.但是……您已经知道术语“最长公共子序列”,并且可以找到许多动态规划算法的描述。
Wiki link 维基链接

pseudocode伪代码

function LCSLength(X[1..m], Y[1..n])
    C = array(0..m, 0..n)
    for i := 0..m
        C[i,0] = 0
    for j := 0..n
        C[0,j] = 0
    for i := 1..m
        for j := 1..n
            if X[i] = Y[j] //i-1 and j-1 if reading X & Y from zero
                C[i,j] := C[i-1,j-1] + 1
            else
                C[i,j] := max(C[i,j-1], C[i-1,j])
    return C[m,n]

function backtrack(C[0..m,0..n], X[1..m], Y[1..n], i, j)
    if i = 0 or j = 0
        return ""
    if  X[i] = Y[j]
        return backtrack(C, X, Y, i-1, j-1) + X[i]
    if C[i,j-1] > C[i-1,j]
        return backtrack(C, X, Y, i, j-1)
    return backtrack(C, X, Y, i-1, j)

Much easier solution ----- Thank you!更简单的解决方案-----谢谢!

def f(s, s1):
 cc = list(set(s) & set(s1))
 ns = ''.join([S for S in s if S in cc])
 ns1 = ''.join([S for S in s1 if S in cc])
 found = []
 b = ns[0]
 for e in ns[1:]:
    cs = b+e
    if cs in ns1:
        found.append(cs)
    b = e
 return found

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM