简体   繁体   中英

python: iterate through dynamic list

From a set of sequences (strings), I want to generate a dictionary of subsets where each of the sequences represents a key, and the values should be all sequences that match at most at "match" (eg 1) positions, both with the original sequence (key), and also with all value entries that are in the subset at that time.

For example, considering all sequences of length 3 consisting of "A", "C" and "G" and "T", one of the key, value pairs should be (or could be, since there are more possible way to select such a set).

This is the definition I came up with:

def pick(seq,superset):
    subset = [seq]
    for seq in subset:
        count = 0
        for item in superset:
            if len([i for i, j in zip(list(seq),list(item)) if i==j])==match:
                count += 1
                if len(subset)==count:
                    subset += [''.join(item)]
    return subset

what I get:

{'AAA': ['AAA', 'ACC', 'ACG', 'ACT', 'AGC', 'AGG', 'AGT', 'ATC', 'ATG', 'ATT', 'CAC', 'CAG',
'CAT', 'CCA', 'CGA', 'CTA', 'GAC', 'GAG', 'GAT', 'GCA', 'GGA', 'GTA', 'TAC', 'TAG', 'TAT',
'TCA', 'TGA', 'TTA']}

What I want:

{'AAA': ['CCC','GGG','TTT','ACG','CGT','GTA','TAC']}

The issue I run into, is that I now only generate a subset where the values match no more than at one position with the key, but the value sequences do have more than one positions specific match with other values in the subset. Does anyone have a(n elegant) solution to this problem?

I'm interpreting your question as "I want to get a list of all items in superset that have somewhere between 0 and match matching characters with seq . But right now my function returns a list of all items that have exactly match matching characters. Also the first element of the returned list is equal to seq , which I don't want."

The first problem occurs because you use "==" when comparing to match, instead of "<=". The second problem occurs because you initialize subset to contain seq even though you don't need to. It's also unnecessary to have two for loops. Also, consider using append instead of += when adding items to a list, as it is almost always more efficient.

def pick(seq,superset,match):
    subset = []
    for item in superset:
        if len([i for i, j in zip(list(seq),list(item)) if i==j])<=match:
            subset.append(''.join(item))
    return subset

superset = [
    'GGG', 'GGC', 'GGA', 'GGT', 'GCG', 'GCC', 'GCA', 'GCT', 'GAG', 'GAC', 'GAA', 'GAT', 'GTG', 'GTC', 'GTA', 'GTT', 
    'CGG', 'CGC', 'CGA', 'CGT', 'CCG', 'CCC', 'CCA', 'CCT', 'CAG', 'CAC', 'CAA', 'CAT', 'CTG', 'CTC', 'CTA', 'CTT', 
    'AGG', 'AGC', 'AGA', 'AGT', 'ACG', 'ACC', 'ACA', 'ACT', 'AAG', 'AAC', 'AAA', 'AAT', 'ATG', 'ATC', 'ATA', 'ATT', 
    'TGG', 'TGC', 'TGA', 'TGT', 'TCG', 'TCC', 'TCA', 'TCT', 'TAG', 'TAC', 'TAA', 'TAT', 'TTG', 'TTC', 'TTA', 'TTT'
]

seq = "AAA"

print pick(seq, superset, 1)

Result (newlines added by me for clarity):

['GGG', 'GGC', 'GGA', 'GGT', 'GCG', 'GCC', 'GCA', 'GCT', 'GAG', 'GAC', 'GAT', 'GTG', 'GTC', 'GTA', 'GTT', 
'CGG', 'CGC', 'CGA', 'CGT', 'CCG', 'CCC', 'CCA', 'CCT', 'CAG', 'CAC', 'CAT', 'CTG', 'CTC', 'CTA', 'CTT', 
'AGG', 'AGC', 'AGT', 'ACG', 'ACC', 'ACT', 'ATG', 'ATC', 'ATT', 
'TGG', 'TGC', 'TGA', 'TGT', 'TCG', 'TCC', 'TCA', 'TCT', 'TAG', 'TAC', 'TAT', 'TTG', 'TTC', 'TTA', 'TTT']

Edit: if each potential item must also satisfy the matching criteria with every other existing element of the subset, you can check this using all and a list comprehension. Note that the value returned by the function will depend on the order of superset , since there are multiple different local maxima that could satisfy the criteria.

def similarity(a,b):
    return sum(1 for p,q in zip(a,b) if p==q)

def pick(seq, superset, match):
    subset = []
    for item in superset:
        if similarity(item, seq) <= match and all(similarity(item, x) <= match for x in subset):
            subset.append(item)
    return subset

superset = [
    'GGG', 'GGC', 'GGA', 'GGT', 'GCG', 'GCC', 'GCA', 'GCT', 'GAG', 'GAC', 'GAA', 'GAT', 'GTG', 'GTC', 'GTA', 'GTT', 
    'CGG', 'CGC', 'CGA', 'CGT', 'CCG', 'CCC', 'CCA', 'CCT', 'CAG', 'CAC', 'CAA', 'CAT', 'CTG', 'CTC', 'CTA', 'CTT', 
    'AGG', 'AGC', 'AGA', 'AGT', 'ACG', 'ACC', 'ACA', 'ACT', 'AAG', 'AAC', 'AAA', 'AAT', 'ATG', 'ATC', 'ATA', 'ATT', 
    'TGG', 'TGC', 'TGA', 'TGT', 'TCG', 'TCC', 'TCA', 'TCT', 'TAG', 'TAC', 'TAA', 'TAT', 'TTG', 'TTC', 'TTA', 'TTT'
]

seq = "AAA"

print pick(seq, superset, 1)

Result:

['GGG', 'GCC', 'GAT', 'GTA', 'CGC', 'CCG', 'CTT', 'AGT', 'ATG', 'TGA', 'TCT', 'TAG', 'TTC']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM