简体   繁体   English

多次循环遍历列表

[英]Looping over list multiple times

Is it possible to iterate through a list multiple times?是否可以多次遍历列表? basically, I have a list of strings and I am looking for the longest superstring.基本上,我有一个字符串列表,我正在寻找最长的超字符串。 Each of the strings in the list has some overlap of at least half of their length and they are all the same size.I want to see if the superstring I'm adding onto startswith or endswith each of the sequences in the list and when I find a match I want to add that element to my superstring, delete the element from the list and then loop over it again and again until my list is empty.列表中的每个字符串都有至少一半长度的重叠,并且它们的大小都相同。我想看看我添加到列表中的每个序列的超级字符串是开始还是结束找到一个匹配项 我想将该元素添加到我的超字符串中,从列表中删除该元素,然后一次又一次地循环遍历它,直到我的列表为空。

sequences=['ATTAGACCTG','CCTGCCGGAA','AGACCTGCCG',''GCCGGAATAC']
halfway= len(sequences[0])/2
genome=sequences[0]     # this is the string that will be added onto throughout the loop
sequences.remove(sequences[0]) 


for j in range(len(sequences)):
    for sequence in sequences:
        front=[]
        back=[]
        for i in range(halfway,len(sequence)):

            if genome.endswith(sequence[:i]):
                genome=genome+sequence[i:] 
                sequences.remove(sequence)

            elif genome.startswith(sequence[-i:]):
                genome=sequence[:i]+genome  
                sequences.remove(sequence)
'''
            elif not genome.startswith(sequence[-i:]) or not genome.endswith(sequence[:i]):

                sequences.remove(sequence)      # this doesnt seem to work want to get rid of 
                                                #sequences that are in the middle of the string and 
                                                 #already accounted for 
'''

this works when I dont use the final elif statement and gives me the correct answer ATTAGACCTGCCGGAATAC.当我不使用最终的 elif 语句并给我正确答案 ATTAGACCTGCCGGAATAC 时,这会起作用。 However, when I do this with a larger list of strings I am still left with strings in the list that I expected to be empty.但是,当我使用更大的字符串列表执行此操作时,我仍然希望列表中的字符串为空。 Also is the last loop even necessary if I am only looking for strings to add onto the front and back of the superstring (genome in my code).如果我只是在寻找要添加到超字符串(我的代码中的基因组)前后的字符串,那么最后一个循环也是必要的。

try this:尝试这个:

sequences=['ATTAGACCTG','CCTGCCGGAA','AGACCTGCCG','GCCGGAATAC']
sequences.reverse()
genome = sequences.pop(-1)     # this is the string that will be added onto throughout the loop

unrelated = []

while(sequences):
    sequence = sequences.pop(-1)
    if sequence in genome: continue
    found=False
    for i in range(3,len(sequence)):
        if genome.endswith(sequence[:i]):
            genome=genome+sequence[i:]
            found = True
            break
        elif genome.startswith(sequence[-i:]):
            genome=sequence[:i]+genome  
            found = True
            break
    if not found:
        unrelated.append(sequence)

print(genome)
#ATTAGACCTGCCGGAATAC
print(sequences)
#[]
print(unrelated)
#[]

I do not know if you are guaranteed to not have multiple unrelated sequences in the same batch, so I allowed for the unrelated.我不知道你是否保证在同一批中没有多个不相关的序列,所以我允许不相关的。 If that is not necessary, feel free to remove.如果这不是必需的,请随意删除。

Python's handling of deleting from the front of a list is very inefficient, so I reversed the list and pull from the back. Python 处理从list前面删除的处理效率非常低,因此我将列表颠倒并从后面拉。 The reversal might not be necessary depending on the full data (it is with your example data).根据完整数据(它与您的示例数据),可能不需要反转。

I pop from the sequences list while there are sequences available to avoid removing elements from a list while iterating through it.当有可用sequences时,我从sequences list弹出,以避免在遍历list时从list删除元素。 I then check to see if it is already in the final genome.然后我检查它是否已经在最终的基因组中。 If it is not then I go into checking the endswith / beginswith checks.如果不是,那么我会开始检查endswith / beginswith检查。 If a match is found, slice it into genome;如果找到匹配,将其切片到基因组中; set found flag;设置找到标志; break out of the for loop跳出for循环

If the sequence is not already contained and a partial match is not found, it gets put into unrelated如果序列尚未包含且未找到部分匹配,则将其放入unrelated

This is how I ended up solving it, I realized that all you need to do is find out which string is the start of the superstring, since we know that the sequences have an overlap of 1/2 or more I found which half wasn't contained in any of the sequences.这就是我最终解决它的方式,我意识到您需要做的就是找出哪个字符串是超字符串的开头,因为我们知道序列有 1/2 或更多的重叠我发现哪一半不是' t 包含在任何序列中。 From here I looped over a list the amount of times equal to the length of the list and looked for sequences in which the ending of the genome matched the beginning of the appropriate sequence.从这里我循环了一个列表,次数等于列表的长度,并寻找基因组结尾与适当序列开头匹配的序列。 When I found this I added the sequence onto the genome(superstring) and then removed this sequence and continued iterating through the list.当我发现这个时,我将序列添加到基因组(超字符串)上,然后删除了这个序列并继续遍历列表。 When working with a list of 50 sequences that have a length of 1000 this code takes around .806441 to run当使用长度为 1000 的 50 个序列的列表时,此代码大约需要 0.806441 才能运行

def moveFirstSeq(seqList): # move the first sequence in genome to the end of list 
    d={}
    for seq in seqList:
        count=0
        for seq1 in seqList:

            if seq==seq1:
                pass
            if seq[0:len(seq)/2] not in seq1:
                count+=1
                d[seq]= count

    sorted_values=sorted(d.values())
    first_sequence=''
    for k,v in d.items():
        if v==sorted_values[-1]:
            first_sequence=k
            seqList.remove(first_sequence)

            seqList.append(first_sequence)

    return seqList


seq= moveFirstSeq(sequences)  
genome = seq.pop(-1)   # added first sequence to genome and removed from list 

for j in range(len(sequences)):   # looping over the list amount of times equal to the length of the sequence list  
    for sequence in sequences:

        for i in range(len(sequence)/2,len(sequence)):

            if genome.endswith(sequence[:i]):
                genome=genome+sequence[i:]  # adding onto the superstring and 
                sequences.remove(sequence) #removing it from the sequence list 

print genome , seq 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM