简体   繁体   中英

Extend list of lists only if first item of new list is unique

I'm working on parsing an output file for a NCBI Blast Search for a bioinformatics application. Essentially, the search takes a template genetic sequence and finds a series of sequences (contigs) with significant similarity to the template sequence.

In order to extract the many matches for contigs, my goal is to create a list of lists with the following format:

'[(contig #), (frame #), (first character # of the subject ("Sbjct")),(last character # of the subject ("Sbjct")]'

eg the output sublist for a given section with contig #1568, frame = -1, starting on character #5509 of the subject and ending on character #3914 of the subject is:

[1568,-1,5509,3914]

In this question I've left off the final item of the sublists. My challenge is that because there are multiple readout files, sometimes containing the same contig as other files, the list of lists that I'm creating sometimes gets extended with the same contig twice. Let me explain.

As depicted in the posted code block below, I tried to only add a new sublist if the sublist was unique (not already present). The issue I think I had with that is that all of the items in a sublist were compared to all of the items in the other sublist. This led to duplicates owing to the fact that although the contig # was the same, the other parameters were not the same. I just want the first sublist with a particular contig # to be the one it keeps without regard to the other parameters.

for ind, line in enumerate(contents,1):
    if re.search("(.*)>(.*)", line):
        c1 = line.split('[')
        c2 = c1[1].split(']')
        c3 = c2[0]
        my_line = getline(file.name, ind + 5)
        f1 = my_line.split('= ')
        if '+' in f1[1]:
            f2 = f1[1].split('+')
            f3 = f2[1].split('\n')[0]
        else:
            f3 = f1[1].split('\n')[0]
            my_line2 = getline(file.name, ind + 7)
            q1 = my_line2.split(' ')[2]
            my_line3 = getline(file.name, ind - 3)  
            l1= [c3,f3,q1]
            if l1 not in x:
                x.extend([l1])

Here is what I received for my actual output:

[['1568', '-1', '12'], ['0003', '1', '12'], ['0130', '3', '12'], ['0097', '1', '20'], ['0512', '3', '11'], ['0315', '-1', '296'], ['0118', '-2', '52'], ['0308', '-3', '488'], ['1568', '-1', '1'], ['0003', '1', '1'], ['0130', '3', '4'], ['0097', '1', '28'], ['0512', '3', '23'], ['0315', '-1', '21'], ['0118', '-2', '39'], ['0102', '-3', '293'], ['0495', '-1', '146'], ['0386', '-3', '146']]

And here is what I expected:

[['1568', '-1', '12'], ['0003', '1', '12'], ['0130', '3', '12'], ['0097', '1', '20'], ['0512', '3', '11'], ['0315', '-1', '296'], ['0118', '-2', '52'], ['0308', '-3', '488'], ['0102', '-3', '293'], ['0495', '-1', '146'], ['0386', '-3', '146']]

How might I only add a sublist if the first item of the new sublist isn't in any of the other sublists? Please help!

This might be a quick fix, replace the line:

if l1 not in x:

With:

#if (any(c3 in temp for temp in x)):
if (not any(c3 == temp[0] for temp in x)):

This will check if there are any instances of c3 (your first element in the l1 sub-list) in any of the temp lists already contained in x

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM