繁体   English   中英

许多子序列之间最长的共同序列

[英]Longest common sequence between many sub-sequences

花哨的标题:)我有一个包含以下内容的文件:

>sequence_40
ABCDABDCABCDBACDBACDBACDBACDABDCDC
ACDCCDCABDCADCADBCACBDCABD
>sequence_41
DCBACDBACDADCDCDCABCDCACBDCBDACBDC
BCDBABABBABACDCDBCACDBACDBACDBACDC
BCDB
...

然后,我有一个函数返回一个字典(称为dict),它返回序列作为键和字符串(组合在一行上)作为键的值。 序列范围从40到59.我想获取序列字典并返回在所有序列中找到的最长公共子序列。 管理在stackoverflow上找到一些帮助,并制作了一个代码,只比较该字典中的最后两个字符串,而不是所有字符串:)。 这是代码

def longest_common_sequence(s1, s2):
    m = [[0] * (1 + len(s2)) for i in range(1 + len(s1))]
    longest, x_longest = 0, 0
    for x in range(1, 1 + len(s1)):
        for y in range(1, 1 + len(s2)):
            if s1[x - 1] == s2[y - 1]:
                m[x][y] = m[x - 1][y - 1] + 1
                if m[x][y] > longest:
                    longest = m[x][y]
                    x_longest = x
            else:
                m[x][y] = 0
    return s1[x_longest - longest: x_longest]

for i in range(40,59):
    s1=str(dictionar['sequence_'+str(i)])
    s2=str(dictionar['sequence_'+str(i+1)])
longest_common_sequence(s1,s2)

如何修改它以获得字典中所有序列之间的公共子序列? 谢谢!

编辑:正如@lmcarreiro所指出的, 子串 (或子阵列子列表 )和 序列之间存在相关差异。 根据我的理解,我们都在谈论这里的子串 ,所以我将在我的答案中使用这个术语。

Guillaumes答案可以改进:

def eachPossibleSubstring(string):
  for size in range(len(string) + 1, 0, -1):
    for start in range(len(string) - size + 1):
      yield string[start:start+size]

def findLongestCommonSubstring(strings):
  shortestString = min(strings, key=len)
  for substring in eachPossibleSubstring(shortestString):
    if all(substring in string
        for string in strings if string != shortestString):
      return substring

print findLongestCommonSubstring([
  'ABCDABDCABCDBACDBACDBACDBACDABDCDCACDCCDCABDCADCADBCACBDCABD',
  'DCBACDBACDADCDCDCABCDCACBDCBDACBDCBCDBABABBABACDCDBCACDBACDBACDBACDCBCDB',
])

这打印:

ACDBACDBACDBACD

这更快,因为我返回第一个找到并搜索从最长到最短。

基本思路是这样:取最短字符串的每个可能的子字符串(按照从最长到最短的顺序),看看是否可以在所有其他字符串中找到此子字符串。 如果是这样,返回它,否则尝试下一个子串。

你需要了解发电机 试试这样:

for substring in eachPossibleSubstring('abcd'):
  print substring

要么

print list(eachPossibleSubstring('abcd'))

我首先定义一个函数来返回给定序列的所有可能的子序列:

from itertools import combinations_with_replacement
def subsequences(sequence):
    "returns all possible subquences of a given sequence"
    for start, stop in combinations_with_replacement(range(len(sequence)), 2):
        if start < stop:
            yield sequence[start:stop]

然后我会用另一种方法检查所有给定序列中是否存在给定的子序列:

def is_common_subsequence(sub, sequences):
    "returns True if <sub> is a common subsequence in all <sequences>"
    return all(sub in sequence for sequence in sequences)

然后使用上面的两种方法很容易得到给定序列集中的所有常见子序列:

def common_sequences(sequences):
    "return all subsequences common in sequences"
    shortest_seq = min(sequences, key=len)
    return set(subsequence for subsequence in subsequences(shortest_seq) \
       if is_common_subsequence(subsequence, sequences))

...并提取最长的序列:

def longuest_common_subsequence(sequences):
    "returns the longuest subsequence in sequences"
    return max(common_sequences(sequences), key=len)

结果:

sequences = {
    41: 'ABCDEFGHIJKLMNOPQRSTUVWXYZ',
    42: '123ABCDEFGHIJKLMNOPQRSTUVW',
    43: '123456ABCDEFGHIJKLMNOPQRST'
}

sequences2 = {
    0: 'ABCDEFGHIJ',
    1: 'DHSABCDFKDDSA',
    2: 'SGABCEIDEFJRNF'
}

print(longuest_common_subsequence(sequences.values()))
>>> ABCDEFGHIJKLMNOPQRST

print(longuest_common_subsequence(sequences2.values()))
>>> ABC

在这里你有一个可能的方法。 首先让我们定义一个返回两个字符串之间最长子串的函数:

def longest_substring(s1, s2):
    t = [[0]*(1+len(s2)) for i in range(1+len(s1))]
    l, xl = 0, 0
    for x in range(1,1+len(s1)):
        for y in range(1,1+len(s2)):
            if s1[x-1] == s2[y-1]:
                t[x][y] = t[x-1][y-1] + 1
                if t[x][y]>l:
                    l = t[x][y]
                    xl  = x
            else:
                t[x][y] = 0
    return s1[xl-l: xl]

现在我将为该示例创建一个序列的随机dict

import random
import string

d = {i : ''.join(random.choice(string.ascii_uppercase) for _ in range(50)) for i in range(10)}

print d

{0: 'ASCUCEVJNIGWVMWMBBQQBZYBBNGQAJRYXACGFEIFWHMBCNYRGL', 1: 'HKUKZOJJUCRTSBLNZXCIBARLPNAPAABRBZEVGVILJAFCGWGQVV', 2: 'MMHCYPKECRJFEWTGYITMHZSNHAFEZVFYDAVILRYRKIDDBEFRVX', 3: 'DGBULRFJINFZEELDASRFBIRSADWMRAYMGCDAOJDKQIMXIRLTEI', 4: 'VDUFWZSXLRGOIMAHOAMZAIWDPTHDVDXUACRBASJMCUHREDORRH', 5: 'RFGAVHOWNKRZMYMSFSSNUGCKEWUNVETCDWJXSPBJHKSTPFNSJO', 6: 'HFMLMHCFSOEXBXWFAROIRGJNPRTKRWCEPLFOKGMXNUPCPWREWX', 7: 'CNPGSHGVIRLDXAADXUVWCTJCXUHQLALBUOJMXQBKXWHKGSJHEH', 8: 'UWDXXTRCFNCBUBEYGYTDWTPLNTRHYQWKTHPRVCBAWIMNGHULDC', 9: 'OOCJRXBZKJIGHZEJOOIKWKMQKIEQVPEDTFPJQAUQKJQVLOMGJB'}

最后,我们需要找到所有序列之间最长的子序列:

import itertools
max([longest_substring(i,j) for i,j in itertools.combinations(d.values(), 2)], key=len)

输出:

'VIL'    

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM