简体   繁体   English

许多子序列之间最长的共同序列

[英]Longest common sequence between many sub-sequences

Fancy title :) I have a file that contains the following: 花哨的标题:)我有一个包含以下内容的文件:

>sequence_40
ABCDABDCABCDBACDBACDBACDBACDABDCDC
ACDCCDCABDCADCADBCACBDCABD
>sequence_41
DCBACDBACDADCDCDCABCDCACBDCBDACBDC
BCDBABABBABACDCDBCACDBACDBACDBACDC
BCDB
...

Then, I have a function that returns a dictionary (called dict) that returns the sequences as keys and the strings (combined on one line) as values for the keys. 然后,我有一个函数返回一个字典(称为dict),它返回序列作为键和字符串(组合在一行上)作为键的值。 The sequences range from 40 to 59. I want to take a dictionary of sequences and return the longest common sub-sequence found in ALL the sequences. 序列范围从40到59.我想获取序列字典并返回在所有序列中找到的最长公共子序列。 Managed to find some help here on stackoverflow and made a code that only compares the LAST TWO strings in that dictionary, not all of them :). 管理在stackoverflow上找到一些帮助,并制作了一个代码,只比较该字典中的最后两个字符串,而不是所有字符串:)。 This is the code 这是代码

def longest_common_sequence(s1, s2):
    m = [[0] * (1 + len(s2)) for i in range(1 + len(s1))]
    longest, x_longest = 0, 0
    for x in range(1, 1 + len(s1)):
        for y in range(1, 1 + len(s2)):
            if s1[x - 1] == s2[y - 1]:
                m[x][y] = m[x - 1][y - 1] + 1
                if m[x][y] > longest:
                    longest = m[x][y]
                    x_longest = x
            else:
                m[x][y] = 0
    return s1[x_longest - longest: x_longest]

for i in range(40,59):
    s1=str(dictionar['sequence_'+str(i)])
    s2=str(dictionar['sequence_'+str(i+1)])
longest_common_sequence(s1,s2)

How can I modify it to get the common subsequence among ALL sequences in dictionary? 如何修改它以获得字典中所有序列之间的公共子序列? Thanks! 谢谢!

EDIT: As @lmcarreiro pointed out, there is a relevant difference between substrings (or subarrays or sublists ) and subsequences . 编辑:正如@lmcarreiro所指出的, 子串 (或子阵列子列表 )和 序列之间存在相关差异。 To my understanding we are all talking about substrings here, so I will use this term in my answer. 根据我的理解,我们都在谈论这里的子串 ,所以我将在我的答案中使用这个术语。

Guillaumes answer can be improved: Guillaumes答案可以改进:

def eachPossibleSubstring(string):
  for size in range(len(string) + 1, 0, -1):
    for start in range(len(string) - size + 1):
      yield string[start:start+size]

def findLongestCommonSubstring(strings):
  shortestString = min(strings, key=len)
  for substring in eachPossibleSubstring(shortestString):
    if all(substring in string
        for string in strings if string != shortestString):
      return substring

print findLongestCommonSubstring([
  'ABCDABDCABCDBACDBACDBACDBACDABDCDCACDCCDCABDCADCADBCACBDCABD',
  'DCBACDBACDADCDCDCABCDCACBDCBDACBDCBCDBABABBABACDCDBCACDBACDBACDBACDCBCDB',
])

This prints: 这打印:

ACDBACDBACDBACD

This is faster because I return the first found and search from longest to shortest. 这更快,因为我返回第一个找到并搜索从最长到最短。

The basic idea is this: Take each possible substring of the shortest of your strings (in the order from the longest to the shortest) and see if this substring can be found in all other strings. 基本思路是这样:取最短字符串的每个可能的子字符串(按照从最长到最短的顺序),看看是否可以在所有其他字符串中找到此子字符串。 If so, return it, otherwise try the next substring. 如果是这样,返回它,否则尝试下一个子串。

You need to understand generators . 你需要了解发电机 Try eg this: 试试这样:

for substring in eachPossibleSubstring('abcd'):
  print substring

or 要么

print list(eachPossibleSubstring('abcd'))

I'd start by defining a function to return all possible subsequences of a given sequence: 我首先定义一个函数来返回给定序列的所有可能的子序列:

from itertools import combinations_with_replacement
def subsequences(sequence):
    "returns all possible subquences of a given sequence"
    for start, stop in combinations_with_replacement(range(len(sequence)), 2):
        if start < stop:
            yield sequence[start:stop]

then I'd make another method to check if a given subsequence in present in all given sequences: 然后我会用另一种方法检查所有给定序列中是否存在给定的子序列:

def is_common_subsequence(sub, sequences):
    "returns True if <sub> is a common subsequence in all <sequences>"
    return all(sub in sequence for sequence in sequences)

then using the 2 methods above it is pretty easy to get all common subsequences in a given set of sequences: 然后使用上面的两种方法很容易得到给定序列集中的所有常见子序列:

def common_sequences(sequences):
    "return all subsequences common in sequences"
    shortest_seq = min(sequences, key=len)
    return set(subsequence for subsequence in subsequences(shortest_seq) \
       if is_common_subsequence(subsequence, sequences))

... and extracting the longuest sequence: ...并提取最长的序列:

def longuest_common_subsequence(sequences):
    "returns the longuest subsequence in sequences"
    return max(common_sequences(sequences), key=len)

Result: 结果:

sequences = {
    41: 'ABCDEFGHIJKLMNOPQRSTUVWXYZ',
    42: '123ABCDEFGHIJKLMNOPQRSTUVW',
    43: '123456ABCDEFGHIJKLMNOPQRST'
}

sequences2 = {
    0: 'ABCDEFGHIJ',
    1: 'DHSABCDFKDDSA',
    2: 'SGABCEIDEFJRNF'
}

print(longuest_common_subsequence(sequences.values()))
>>> ABCDEFGHIJKLMNOPQRST

print(longuest_common_subsequence(sequences2.values()))
>>> ABC

Here you have a possible approach. 在这里你有一个可能的方法。 First let's define a function that returns the longest substring between two strings: 首先让我们定义一个返回两个字符串之间最长子串的函数:

def longest_substring(s1, s2):
    t = [[0]*(1+len(s2)) for i in range(1+len(s1))]
    l, xl = 0, 0
    for x in range(1,1+len(s1)):
        for y in range(1,1+len(s2)):
            if s1[x-1] == s2[y-1]:
                t[x][y] = t[x-1][y-1] + 1
                if t[x][y]>l:
                    l = t[x][y]
                    xl  = x
            else:
                t[x][y] = 0
    return s1[xl-l: xl]

Now I'll create a random dict of sequences for the example: 现在我将为该示例创建一个序列的随机dict

import random
import string

d = {i : ''.join(random.choice(string.ascii_uppercase) for _ in range(50)) for i in range(10)}

print d

{0: 'ASCUCEVJNIGWVMWMBBQQBZYBBNGQAJRYXACGFEIFWHMBCNYRGL', 1: 'HKUKZOJJUCRTSBLNZXCIBARLPNAPAABRBZEVGVILJAFCGWGQVV', 2: 'MMHCYPKECRJFEWTGYITMHZSNHAFEZVFYDAVILRYRKIDDBEFRVX', 3: 'DGBULRFJINFZEELDASRFBIRSADWMRAYMGCDAOJDKQIMXIRLTEI', 4: 'VDUFWZSXLRGOIMAHOAMZAIWDPTHDVDXUACRBASJMCUHREDORRH', 5: 'RFGAVHOWNKRZMYMSFSSNUGCKEWUNVETCDWJXSPBJHKSTPFNSJO', 6: 'HFMLMHCFSOEXBXWFAROIRGJNPRTKRWCEPLFOKGMXNUPCPWREWX', 7: 'CNPGSHGVIRLDXAADXUVWCTJCXUHQLALBUOJMXQBKXWHKGSJHEH', 8: 'UWDXXTRCFNCBUBEYGYTDWTPLNTRHYQWKTHPRVCBAWIMNGHULDC', 9: 'OOCJRXBZKJIGHZEJOOIKWKMQKIEQVPEDTFPJQAUQKJQVLOMGJB'}

Finally, we need to find the longest subsequence between all sequences: 最后,我们需要找到所有序列之间最长的子序列:

import itertools
max([longest_substring(i,j) for i,j in itertools.combinations(d.values(), 2)], key=len)

Output: 输出:

'VIL'    

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM