简体   繁体   中英

python iterate select only string contains certain character

I want to iterate over kmers list and select items only contains character A, T, G and C

kmers=["AL","AT","GC","AA","AP"]

for kmer in kmers:       
    for letter in kmer:
        if letter not in ["A","T","G","C"]:
            pass
        else:
            DNA_kmers.append(kmer)
            print("DNA_kmers",DNA_kmers)

output:

DNA_kmers ['AL', 'AT', 'AT', 'GC', 'GC', 'AA', 'AA', 'AP']

desired output:

DNA_kmers=["AT","GC","AA"]

The only method i know is

if "B" in kmer or "D" in kmer or "E" in kmer or "F" in kmer or "H" in kmer or "I" in kmer or "J" in kmer or "K" in kmer or "L" in kmer or "M" in kmer or "N" in kmer or "O" in kmer or "P" in kmer or "Q" in kmer or "R" in kmer or "S" in kmer or "U" in kmer or "V" in kmer or "W" in kmer or "X" in kmer or "Y" in kmer or "Z" in kmer:
   pass

You code will currently add any items where either character is a match. We can adjust it to add only items where both characters match:

kmers=["AL","AT","GC","AA","AP"]
DNA_kmers =[]

for kmer in kmers:       
    for letter in kmer:
        if letter not in ["A","T","G","C"]:
            break
    else:
        DNA_kmers.append(kmer)

print("DNA_kmers",DNA_kmers)

If you aren't familiar with Python, I've made use of the else clause on the for loop. This isn't available in all languages. The else block will be run if and only if the loop completes all iterations.

There are significantly simpler ways to do what you are trying to do. For example, the following will get the job done using a nested list comprehension:

kmers=["AL","AT","GC","AA","AP"]

allowed = set("AGCT")
print([k for k in kmers if all([c in allowed for c in k])])

A more performant general-purpose solution is to use regular expressions:

import re

kmers=["AL","AT","GC","AA","AP"]
r = re.compile("^[ATGC]*$")
print([k for k in kmers if r.match(k)])

If we limit the problem to only k-mers where k=2, we can further optimize the performance. The regex performance should increase slightly if matching a fixed length string, such as using [AGCT]{2} . We can also use product to create a set to use for constant time lookups:

import itertools

kmers=["AL","AT","GC","AA","AP"]

allowed = {a+b for a,b in itertools.product("AGCT", repeat=2)}
print([k for k in kmers if k in allowed])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM