簡體   English   中英

許多子序列之間最長的共同序列

[英]Longest common sequence between many sub-sequences

花哨的標題:)我有一個包含以下內容的文件:

>sequence_40
ABCDABDCABCDBACDBACDBACDBACDABDCDC
ACDCCDCABDCADCADBCACBDCABD
>sequence_41
DCBACDBACDADCDCDCABCDCACBDCBDACBDC
BCDBABABBABACDCDBCACDBACDBACDBACDC
BCDB
...

然后,我有一個函數返回一個字典(稱為dict),它返回序列作為鍵和字符串(組合在一行上)作為鍵的值。 序列范圍從40到59.我想獲取序列字典並返回在所有序列中找到的最長公共子序列。 管理在stackoverflow上找到一些幫助,並制作了一個代碼,只比較該字典中的最后兩個字符串,而不是所有字符串:)。 這是代碼

def longest_common_sequence(s1, s2):
    m = [[0] * (1 + len(s2)) for i in range(1 + len(s1))]
    longest, x_longest = 0, 0
    for x in range(1, 1 + len(s1)):
        for y in range(1, 1 + len(s2)):
            if s1[x - 1] == s2[y - 1]:
                m[x][y] = m[x - 1][y - 1] + 1
                if m[x][y] > longest:
                    longest = m[x][y]
                    x_longest = x
            else:
                m[x][y] = 0
    return s1[x_longest - longest: x_longest]

for i in range(40,59):
    s1=str(dictionar['sequence_'+str(i)])
    s2=str(dictionar['sequence_'+str(i+1)])
longest_common_sequence(s1,s2)

如何修改它以獲得字典中所有序列之間的公共子序列? 謝謝!

編輯:正如@lmcarreiro所指出的, 子串 (或子陣列子列表 )和 序列之間存在相關差異。 根據我的理解,我們都在談論這里的子串 ,所以我將在我的答案中使用這個術語。

Guillaumes答案可以改進:

def eachPossibleSubstring(string):
  for size in range(len(string) + 1, 0, -1):
    for start in range(len(string) - size + 1):
      yield string[start:start+size]

def findLongestCommonSubstring(strings):
  shortestString = min(strings, key=len)
  for substring in eachPossibleSubstring(shortestString):
    if all(substring in string
        for string in strings if string != shortestString):
      return substring

print findLongestCommonSubstring([
  'ABCDABDCABCDBACDBACDBACDBACDABDCDCACDCCDCABDCADCADBCACBDCABD',
  'DCBACDBACDADCDCDCABCDCACBDCBDACBDCBCDBABABBABACDCDBCACDBACDBACDBACDCBCDB',
])

這打印:

ACDBACDBACDBACD

這更快,因為我返回第一個找到並搜索從最長到最短。

基本思路是這樣:取最短字符串的每個可能的子字符串(按照從最長到最短的順序),看看是否可以在所有其他字符串中找到此子字符串。 如果是這樣,返回它,否則嘗試下一個子串。

你需要了解發電機 試試這樣:

for substring in eachPossibleSubstring('abcd'):
  print substring

要么

print list(eachPossibleSubstring('abcd'))

我首先定義一個函數來返回給定序列的所有可能的子序列:

from itertools import combinations_with_replacement
def subsequences(sequence):
    "returns all possible subquences of a given sequence"
    for start, stop in combinations_with_replacement(range(len(sequence)), 2):
        if start < stop:
            yield sequence[start:stop]

然后我會用另一種方法檢查所有給定序列中是否存在給定的子序列:

def is_common_subsequence(sub, sequences):
    "returns True if <sub> is a common subsequence in all <sequences>"
    return all(sub in sequence for sequence in sequences)

然后使用上面的兩種方法很容易得到給定序列集中的所有常見子序列:

def common_sequences(sequences):
    "return all subsequences common in sequences"
    shortest_seq = min(sequences, key=len)
    return set(subsequence for subsequence in subsequences(shortest_seq) \
       if is_common_subsequence(subsequence, sequences))

...並提取最長的序列:

def longuest_common_subsequence(sequences):
    "returns the longuest subsequence in sequences"
    return max(common_sequences(sequences), key=len)

結果:

sequences = {
    41: 'ABCDEFGHIJKLMNOPQRSTUVWXYZ',
    42: '123ABCDEFGHIJKLMNOPQRSTUVW',
    43: '123456ABCDEFGHIJKLMNOPQRST'
}

sequences2 = {
    0: 'ABCDEFGHIJ',
    1: 'DHSABCDFKDDSA',
    2: 'SGABCEIDEFJRNF'
}

print(longuest_common_subsequence(sequences.values()))
>>> ABCDEFGHIJKLMNOPQRST

print(longuest_common_subsequence(sequences2.values()))
>>> ABC

在這里你有一個可能的方法。 首先讓我們定義一個返回兩個字符串之間最長子串的函數:

def longest_substring(s1, s2):
    t = [[0]*(1+len(s2)) for i in range(1+len(s1))]
    l, xl = 0, 0
    for x in range(1,1+len(s1)):
        for y in range(1,1+len(s2)):
            if s1[x-1] == s2[y-1]:
                t[x][y] = t[x-1][y-1] + 1
                if t[x][y]>l:
                    l = t[x][y]
                    xl  = x
            else:
                t[x][y] = 0
    return s1[xl-l: xl]

現在我將為該示例創建一個序列的隨機dict

import random
import string

d = {i : ''.join(random.choice(string.ascii_uppercase) for _ in range(50)) for i in range(10)}

print d

{0: 'ASCUCEVJNIGWVMWMBBQQBZYBBNGQAJRYXACGFEIFWHMBCNYRGL', 1: 'HKUKZOJJUCRTSBLNZXCIBARLPNAPAABRBZEVGVILJAFCGWGQVV', 2: 'MMHCYPKECRJFEWTGYITMHZSNHAFEZVFYDAVILRYRKIDDBEFRVX', 3: 'DGBULRFJINFZEELDASRFBIRSADWMRAYMGCDAOJDKQIMXIRLTEI', 4: 'VDUFWZSXLRGOIMAHOAMZAIWDPTHDVDXUACRBASJMCUHREDORRH', 5: 'RFGAVHOWNKRZMYMSFSSNUGCKEWUNVETCDWJXSPBJHKSTPFNSJO', 6: 'HFMLMHCFSOEXBXWFAROIRGJNPRTKRWCEPLFOKGMXNUPCPWREWX', 7: 'CNPGSHGVIRLDXAADXUVWCTJCXUHQLALBUOJMXQBKXWHKGSJHEH', 8: 'UWDXXTRCFNCBUBEYGYTDWTPLNTRHYQWKTHPRVCBAWIMNGHULDC', 9: 'OOCJRXBZKJIGHZEJOOIKWKMQKIEQVPEDTFPJQAUQKJQVLOMGJB'}

最后,我們需要找到所有序列之間最長的子序列:

import itertools
max([longest_substring(i,j) for i,j in itertools.combinations(d.values(), 2)], key=len)

輸出:

'VIL'    

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM