简体   繁体   English

如何使用 Python 在两个字符串之间找到最长的公共 substring?

[英]How to find the longest common substring between two strings using Python?

I want to write a Python code that computes the longest common substring between two strings from the input.我想编写一个 Python 代码来计算输入中两个字符串之间的最长公共 substring。

Example:例子:

word1 = input('Give 1. word: xlaqseabcitt')
word2 = input('Give 2. word: peoritabcpeor')

Wanted output:想要 output:

abc

I have code like this so far:到目前为止,我有这样的代码:

word1 = input("Give 1. word: ") 
word2 = input("Give 2. word: ")

longestSegment = "" 
tempSegment = ""

for i in range(len(word1)): 
    if word1[i] == word2[i]:
        tempSegment += word1[i] 
    else: 
        tempSegment = ""

if len(tempSegment) > len(longestSegment):
    longestSegment = tempSegment

print(longestSegment)

I end up with IndexError when word2 is shorter than word1, and it does not give me the common substring.当 word2 比 word1 短时,我最终会出现 IndexError,并且它没有给我常见的 substring。

EDIT: I found this solution:编辑:我找到了这个解决方案:

string1 = input('Give 1. word: ')
string2 = input('Give 2. word: ')
answer = ""
len1, len2 = len(string1), len(string2)
for i in range(len1):
    for j in range(len2):
        lcs_temp=0
        match=''
        while ((i+lcs_temp < len1) and (j+lcs_temp<len2) and   string1[i+lcs_temp] == string2[j+lcs_temp]):
            match += string2[j+lcs_temp]
            lcs_temp+=1
        if (len(match) > len(answer)):
            answer = match

print(answer)

However, I would like to see a library function call that could be used to compute the longest common substring between two strings.但是,我希望看到一个库 function 调用可用于计算两个字符串之间最长的公共 substring。

Alternatively, please suggest a more concise code to achieve the same.或者,请建议一个更简洁的代码来实现相同的目标。

You can build a dictionary from the first string containing the positions of each character, keyed on the characters.您可以从包含每个字符位置的第一个字符串构建一个字典,以字符为键。 Then go through the second string and compare the substring of each character with the rest of the second string at that position:然后 go 通过第二个字符串,将每个字符的 substring 与第二个字符串的 rest 进行比较:

# extract common prefix
def common(A,B) :
    firstDiff = (i for i,(a,b) in enumerate(zip(A,B)) if a!=b) # 1st difference
    commonLen = next(firstDiff,min(len(A),len(B)))             # common length
    return A[:commonLen]

word1 = "xlaqseabcitt"
word2 = "peoritabcpeor"

# position(s) of each character in word1
sub1 = dict()  
for i,c in enumerate(word1): sub1.setdefault(c,[]).append(i)

# maximum (by length) of common prefixes from matching first characters
maxSub = max((common(word2[i:],word1[j:]) 
                       for i,c in enumerate(word2) 
                                 for j in sub1.get(c,[])),key=len)


print(maxSub) # abc

For me, looks like the solution that works is using the suffix_trees package :对我来说,看起来可行的解决方案是使用suffix_trees package

from suffix_trees import STree

a = ["xxx ABC xxx", "adsa abc"]
st = STree.STree(a)
print(st.lcs()) # "abc"

Using difflib使用difflib

from difflib import SequenceMatcher

string1 = "apple pie available"
string2 = "come have some apple pies"
match = SequenceMatcher(None, string1, string2).find_longest_match(0, len(string1), 0, len(string2))
print(string1[match.a:match.a + match.size])
print(string2[match.b:match.b + match.size])

>>>
apple pie
apple pie

" pylcs is a super fast C++ library which adopts dynamic programming (DP) algorithm to solve two classic LCS problems." pylcs是一个超快的 C++ 库,采用动态规划(DP)算法解决两个经典的 LCS 问题。”

Usage:用法:

import pylcs
A = 'xlaqseabcitt'
B = 'peoritabcpeor'

res = pylcs.lcs_string_idx(A, B)
substring = ''.join([B[i] for i in res if i != -1])

print(substring)  # 'abc'

Unfortunately, it's not a single function call.不幸的是,这不是一个单独的 function 调用。 But still it's pretty short.但它仍然很短。

res is a mapping of characters in A to the indices of characters in B of the longest common substring (-1 indicates that this character in A is NOT part of the longest common substring ): resA中的字符到B中最长公共 substring 的字符索引的映射(-1 表示A中的这个字符不是最长公共 substring 的一部分):

[-1, -1, -1, -1, -1, -1, 6, 7, 8, -1, -1, -1]

Here is an answer if you later want to compute any number of strings.如果您以后想要计算任意数量的字符串,这是一个答案。 It should return the longest common substring.它应该返回最长的公共 substring。 It work with the different test i gave it.它适用于我给它的不同测试。 (as long as you don't use the '§' character) It is not a library but you can still import the functions in your code just like a library. (只要您不使用“§”字符)它不是库,但您仍然可以像库一样在代码中导入函数。 You can use the same logic with your own code (only for two strings.) Do so as follows (put both files in the same directory for the sake of simplicity).您可以在自己的代码中使用相同的逻辑(仅适用于两个字符串。)按如下操作(为简单起见,将两个文件放在同一目录中)。 I am supposing you will call the file findmatch.py.我假设您将调用文件 findmatch.py。

import findmatch
longest_substring = findmatch.prep(['list', 'of', 'strings'])

Here is the code that should be in 'findmatch.py'.这是应该在“findmatch.py”中的代码。

def main(words,first):
    nextreference = first
    reference = first
    for word in words:
        foundsub = False
        print('reference : ',reference)
        print('word : ', word)
        num_of_substring = 0
        length_longest_substring = 0
        for i in range(len(word)):
            print('nextreference : ', nextreference)
            letter = word[i]
            print('letter : ', letter)
            if word[i] in reference:
                foundsub = True
                num_of_substring += 1
                locals()['substring'+str(num_of_substring)] = word[i]
                print('substring : ', locals()['substring'+str(num_of_substring)])
                for j in range(len(reference)-i):
                    if word[i:i+j+1] in reference:
                        locals()['substring'+str(num_of_substring) ]= word[i:i+j+1]
                        print('long_sub : ',locals()['substring'+str(num_of_substring)])
                print('new : ',len(locals()['substring'+str(num_of_substring)]))
                print('next : ',len(nextreference))
                print('ref : ', len(reference))
                longer = (len(reference)<len(locals()['substring'+str(num_of_substring)]))
                longer2 = (len(nextreference)<len(locals()['substring'+str(num_of_substring)]))
                if (num_of_substring==1) or longer or longer2:
                    nextreference = locals()['substring'+str(num_of_substring)]
        if not foundsub:
            for i in range(len(words)):
                words[i] = words[i].replace(reference, '§')
                #§ should not be used in any of the strings, put a character you don't use here
            print(words)
            try:
                nextreference = main(words, first)
            except Exception as e:
                return None
        reference = nextreference
    return reference

def prep(words):
    first = words[0]
    words.remove(first)
    answer = main(words, first)
    return answer

if __name__ == '__main__':
    words = ['azerty','azertydqse','fghertqdfqf','ert','sazjjjjjjjjjjjert']
    #just some absurd examples any word in here
    substring = prep(words)
    print('answer : ',substring)

It is basically creating your own library.它基本上是在创建自己的库。

I hope this aswers helps someone.我希望这对某人有所帮助。

Here is a recursive solution:这是一个递归解决方案:

def lcs(X, Y, m, n):

  if m == 0 or n == 0:
      return 0
  elif X[m - 1] == Y[n - 1]:
      return 1 + lcs(X, Y, m - 1, n - 1);
  else:
      return max(lcs(X, Y, m, n - 1), lcs(X, Y, m - 1, n));

Since someone asked for a multiple-word solution, here's one:由于有人要求提供多词解决方案,因此这里有一个:

def multi_lcs(words):
    words.sort(key=lambda x:len(x))
    search = words.pop(0)
    s_len = len(search)
    for ln in range(s_len, 0, -1):
        for start in range(0, s_len-ln+1):
            cand = search[start:start+ln]
            for word in words:
                if cand not in word:
                    break
            else:
                return cand
    return False

>>> multi_lcs(['xlaqseabcitt', 'peoritabcpeor'])
'abc'

>>> multi_lcs(['xlaqseabcitt', 'peoritabcpeor', 'visontatlasrab'])
'ab'

for small strings, copy this into a file in your project, let's say string_utils.py对于小字符串,将其复制到项目中的文件中,例如 string_utils.py

def find_longest_common_substring(string1, string2):

    s1 = string1
    s2 = string2

    longest_substring = ""
    longest_substring_i1 = None
    longest_substring_i2 = None

    # iterate through every index (i1) of s1
    for i1, c1 in enumerate(s1):
        # for each index (i2) of s2 that matches s1[i1]
        for i2, c2 in enumerate(s2):
            # if start of substring
            if c1 == c2:
                delta = 1
                # make sure we aren't running past the end of either string
                while i1 + delta < len(s1) and i2 + delta < len(s2):
                    
                    # if end of substring
                    if s2[i2 + delta] != s1[i1 + delta]:
                        break
                    
                    # still matching characters move to the next character in both strings
                    delta += 1
                
                substring = s1[i1:(i1 + delta)]
                # print(f'substring candidate: {substring}')
                # replace longest_substring if newly found substring is longer
                if len(substring) > len(longest_substring):
                    longest_substring = substring
                    longest_substring_i1 = i1
                    longest_substring_i2 = i2

    return (longest_substring, longest_substring_i1, longest_substring_i2)

Then it can be used as follows:然后可以按如下方式使用:

import string_utils

print(f"""(longest substring, index of string1, index of string2): 
    { string_utils.find_longest_common_substring("stackoverflow.com", "tackerflow")}""")

For any that are curious the print statement when uncommented prints:对于任何对未注释打印时的打印语句感到好奇的人:

substring candidate: tack
substring candidate: ack
substring candidate: ck
substring candidate: o
substring candidate: erflow
substring candidate: rflow
substring candidate: flow
substring candidate: low
substring candidate: ow
substring candidate: w
substring candidate: c
substring candidate: o

(longest substring, index of string1, index of string2): 
('erflow', 7, 4)

Here is a naive solution in terms of time complexity but simple enough to understand:就时间复杂度而言,这是一个简单的解决方案,但足够简单易懂:

def longest_common_substring(a, b):
    """Find longest common substring between two strings A and B."""
    if len(a) > len(b):
        a, b = b, a
    for i in range(len(a), 0, -1):
        for j in range(len(a) - i + 1):
            if a[j:j + i] in b:
                return a[j:j + i]
    return ''

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何以可能使用库函数的pythonic方式找到python中两个字符串之间的最长公共后缀前缀? - How to find the longest common suffix prefix between two strings in python in a pythonic way possibly using library functions? 如何找到多个字符串中最长的公共子字符串? - How to find the longest common substring of multiple strings? 查找两个字符串之间的公共子字符串 - Find common substring between two strings 如何找到两个字符串之间的公共子字符串? - How do I find a common substring between two strings? 如何在 python 中找到两个字符串之间的最长交集? - How to find longest intersection between two strings in python? 两个长列表之间最长的公共子串 - Longest common substring between two long lists 来自两个以上字符串的最长公共子字符串 - Longest common substring from more than two strings Python Re.Search:如何在两个字符串之间找到 substring,该字符串还必须包含特定的 substring - Python Re.Search: How to find a substring between two strings, that must also contain a specific substring 在字典中的字符串和键之间查找最长的公共 substring 后缀 - Find the longest common substring suffix between a string and a key in a dictionary 在python中找到最长的子字符串 - Find longest substring in python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM