简体   繁体   中英

regular expression to find certain bases in a sequence

In my code, what I'm trying to do is clean up a FastA file by only including the letters A,C,T,G,N, and U in the output string. I'm trying to do this through a regular expression, which looks like this:

newFastA = (re.findall(r'A,T,G,C,U,N',self.fastAsequence)) #trying to extract all of the listed bases from my fastAsequence.
        print (newFastA)

However, I am not getting all the occurences of the bases in order. I think the format of my regular expression is incorrect, so if you could let me know what mistake I've made, that would be great.

I'd avoid regex entirely. You can use str.translate to remove the characters you don't want.

from string import ascii_letters

removechars = ''.join(set(ascii_letters) - set('ACTGNU'))

newFastA = self.fastAsequence.translate(None, removechars)

demo:

dna = 'ACTAGAGAUACCACG this will be removed GNUGNUGNU'

dna.translate(None, removechars)
Out[6]: 'ACTAGAGAUACCACG     GNUGNUGNU'

If you want to remove whitespace too, you can toss string.whitespace into removechars .

Sidenote, the above only works in python 2, in python 3 there's an additional step:

from string import ascii_letters, punctuation, whitespace

#showing how to remove whitespace and punctuation too in this example
removechars = ''.join(set(ascii_letters + punctuation + whitespace) - set('ACTGNU'))

trans = str.maketrans('', '', removechars)

dna.translate(trans)
Out[11]: 'ACTAGAGAUACCACGGNUGNUGNU'
print re.sub("[^ACTGNU]","",fastA_string)

to go with the million other answers youll get

or without re

print "".join(filter(lambda character:character in set("ACTGUN"),fastA_string)

You need to use a character set.

re.findall(r"[ATGCUN]", self.fastAsequence)

Your code looks for a LITERAL "A,T,G,C,U,N" , and outputs all occurrences of that. Character sets in regex allow for a search of the type: "Any of the following: A , T , G , C , U , N " rather than "The following: A,T,G,C,U,N "

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM