將列表中的一個元素與另一個列表的所有元素進行比較

Question

我有一個包含各種字母序列的列表。

sequences = ['AAGTAAA', 'AAATGAT', 'AAAGTTT', 'TTTTCCC', 'AATTCGC', 'CGCTCCC']

我想看看該列表中每個序列的最后 3 個字母是否與所有其他序列的前 3 個字母匹配。 如果發生這種情況，我想知道這兩個序列的索引。

我基本上是在嘗試生成一個鄰接列表。 下面是一個輸入示例：

>Sample_0
AAGTAAA
>Sample_1
AAATGAT
>Sample_2
AAAGTTT
>Sample_3
TTTTCCC
>Sample_4
AATTCGC
>Sample_5
CGCTCCC

和 output：

>Sample_0 >Sample_1
>Sample_0 >Sample_2
>Sample_2 >Sample_3
>Sample_4 >Sample_5

現在，我嘗試制作兩個不同的列表，其中包含所有前綴和所有后綴，但我不知道這是否有幫助以及如何使用它來解決我的問題。

file = open("rosalind_grph2.txt", "r")

gene_names, sequences, = [], []
seq = ""

for line in file:
    if line[0] == ">":
        gene_names.append(line.strip())
        if seq == "":
            continue
        sequences.append(seq)
        seq = ""
    if line[0] in "ATCG":
        seq = seq + line.strip()
sequences.append(seq)

#So far I put all I needed into a list

prefix = [i[0:3] for i in sequences]
suffix = [i[len(i)-3:] for i in sequences]

#Now, all suffixes and prefixes are in lists as well
#but what now?  

print(suffix)
print(prefix)
print(sequences)
file.close

Answer 1

如果我正確理解了您的問題，則此代碼將在列表中枚舉兩次。 它將第一個元素的最后 3 個字母與第二個元素的前 3 個字母進行比較，如果匹配，則打印元素的索引。 如果這不是您想要的，請提供反饋/澄清。 這是 O(n^2) 並且如果您進行初始傳遞並將索引存儲在像字典這樣的結構中，則可能會加快速度。


for index1, sequence1 in enumerate(sequences):
    for index2, sequence2 in enumerate(sequences):
        if index1 != index2:
            if sequence1[-3:] == sequence2[0:3]:
                print(sequence1[-3:], index1, sequence2[0:3], index2)

Answer 2

如果我理解正確，您想做的是連接sequences的不同元素，其中連接是字符串的開頭與另一個字符串的結尾匹配。

使用dict的一種方法是使用以下 function match_head_tail() ：

def match_head_tail(items, length=3):
    result = {}
    for x in items:
        v = [y for y in items if y[:length] == x[-length:]]
        if v:
            result[x] = v
    return result

sequences = ['AAGTAAA', 'AAATGAT', 'AAAGTTT', 'TTTTCCC', 'AATTCGC', 'CGCTCCC']

print(match_head_tail(sequences))
# {'AAGTAAA': ['AAATGAT', 'AAAGTTT'], 'AAAGTTT': ['TTTTCCC'], 'AATTCGC': ['CGCTCCC']}

如果您還想包含不匹配的序列，您可以使用以下 function match_head_tail_all() ：

def match_head_tail_all( items, length=3):
    return {x: [y for y in items if y[:length] == x[-length:]] for x in items}

sequences = ['AAGTAAA', 'AAATGAT', 'AAAGTTT', 'TTTTCCC', 'AATTCGC', 'CGCTCCC']

print(match_head_tail_all(sequences))
# {'AAGTAAA': ['AAATGAT', 'AAAGTTT'], 'AAATGAT': [], 'AAAGTTT': ['TTTTCCC'], 'TTTTCCC': [], 'AATTCGC': ['CGCTCCC'], 'CGCTCCC': []}

編輯 1

如果你真的想要索引，請結合上面的enumerate()來獲取它們，例如：

def match_head_tail_all_indexes( items, length=3):
    return {
        i: [j for j, y in enumerate(items) if y[:length] == x[-length:]]
        for i, x in enumerate(items)}


sequences = ['AAGTAAA', 'AAATGAT', 'AAAGTTT', 'TTTTCCC', 'AATTCGC', 'CGCTCCC']

print(match_head_tail_all_indexes(sequences))
# {0: [1, 2], 1: [], 2: [3], 3: [], 4: [5], 5: []}

編輯 2

如果您的輸入包含許多具有相同結尾的序列，您可能需要考慮實現一些緩存機制以提高計算效率（以 memory 效率為代價），例如：

def match_head_tail_cached(items, length=3, caching=True):
    result = {}
    if caching:
        cached = {}
    for x in items:
        if caching and x[-length:] in cached:
            v = cached[x[-length:]]
        else:
            v = [y for y in items if y[:length] == x[-length:]]    
        if v:
            result[x] = v
    return result


sequences = ['AAGTAAA', 'AAATGAT', 'AAAGTTT', 'TTTTCCC', 'AATTCGC', 'CGCTCCC']

print(match_head_tail_cached(sequences))
# {'AAGTAAA': ['AAATGAT', 'AAAGTTT'], 'AAAGTTT': ['TTTTCCC'], 'AATTCGC': ['CGCTCCC']}

編輯 3

所有這些也可以僅使用list來實現，例如：

def match_head_tail_list(items, length=3):
    result = []
    for x in items:
        v = [y for y in items if y[:length] == x[-length:]]
        if v:
            result.append([x, v])
    return result


sequences = ['AAGTAAA', 'AAATGAT', 'AAAGTTT', 'TTTTCCC', 'AATTCGC', 'CGCTCCC']

print(match_head_tail_list(sequences))
# [['AAGTAAA', ['AAATGAT', 'AAAGTTT']], ['AAAGTTT', ['TTTTCCC']], ['AATTCGC', ['CGCTCCC']]]

甚至更少的嵌套：

def match_head_tail_flat(items, length=3):
    result = []
    for x in items:
        for y in items:
            if y[:length] == x[-length:]:
                result.append([x, y])
    return result


sequences = ['AAGTAAA', 'AAATGAT', 'AAAGTTT', 'TTTTCCC', 'AATTCGC', 'CGCTCCC']

print(match_head_tail_flat(sequences))
# [['AAGTAAA', 'AAATGAT'], ['AAGTAAA', 'AAAGTTT'], ['AAAGTTT', 'TTTTCCC'], ['AATTCGC', 'CGCTCCC']]

將列表中的一個元素與另一個列表的所有元素進行比較

問題描述

2 個解決方案

解決方案1
1 2019-10-21 13:23:33

解決方案2
0 2019-10-21 15:03:04

編輯 1

編輯 2

編輯 3

將列表中的一個元素與另一個列表的所有元素進行比較

問題描述

2 個解決方案

解決方案1 1 2019-10-21 13:23:33

解決方案2 0 2019-10-21 15:03:04

編輯 1

編輯 2

編輯 3

解決方案1
1 2019-10-21 13:23:33

解決方案2
0 2019-10-21 15:03:04