I am trying to write a program that loops through a string of RNA bases, finds the start codon ('AUG'), groups the following codons into threes (ie 'GAA', 'ACC'), looks up the corresponding amino acid from the dictionary, creates a string containing the resulting amino acids, and keeps going until it hits a stop codon ('UAA', 'UGA', 'UAG'). RNA gets read in groups of threes, starting from a start codon and ending at a stop codon.
The problem is when I want the program to check to see if it has hit one of the three stop codons, it does not work if I have all three listed in the same if statement. When checking the dictionary, it will treat the stop codon as an unknown ( .get(codon, 'X')
) and list it as an 'X' in the protein:
a_seq = 'AAAAUGGAAUGAACC'
kmer_size = 3
for start in range (0,len(a_seq)- kmer_size+1,1):
kmer = a_seq[start:start+kmer_size]
if kmer == 'AUG':
start_codon = a_seq.index(kmer)
new_seq = a_seq[start_codon:]
last_codon_start = len(new_seq) - 2
dictionary = {'AUG':'M',
'GAA':'E',
'ACC':'T'}
protein = ''
for start in range(0, last_codon_start, 3):
codon = new_seq[start:start+3]
print(codon)
if codon != 'UAA' or codon != 'UGA' or codon != 'UAG':
amino_acid = dictionary.get(codon,'X')
protein += amino_acid
else:
break
print(protein)
break
Output:
AUG
GAA
UAA
ACC
MEXT
If I only list a single stop codon, then it works:
if codon != 'UAA':
AUG
GAA
UAA
ME
Both proteins should be 'ME'. I expect it to stop as soon as it hits any of the three stop codons. What is wrong with my if statement?
This corrects the one line.
if codon != 'UAA' and codon != 'UGA' and codon != 'UAG':
If you say not equal to x or not equal to y, it will always be true. Simplifying a bit
if x != 1 or x !=2:
No matter what x is, the statement will always be true. Every number is not equal to both 1 and 2, including 1 and 2.
But the clearest way to code this line is.
if codon not in ('UAA', 'UGA', 'UAG'):
One final thought is that you could add the stop codes to your dictionary and have them yield some value on which you trigger the break. This would address @Sam Mason's point about efficiency of hash lookups as well as saving some other steps in the main loop.
dictionary = {'AUG': 'M',
'GAA': 'E',
'ACC': 'T',
'UUA': '*',
'UGA': '*',
'UAG': '*',
}
protein = ''
for start in range(0, last_codon_start, 3):
codon = new_seq[start:start+3]
print(codon)
amino_acid = dictionary.get(codon,'X')
if amino_acid == '*':
break
protein += amino_acid
Final thought. The for loop could be simplified slightly by using the textwrap module (standard Python).
from textwrap import wrap
...
...
for codon in wrap(new_seq, 3):
print(codon)
etc.
I think it would more readable to reverse the logic of the inner if
that checks for stop codons with:
if codon == 'UAA' or codon == 'UGA' or codon == 'UAG':
However it would be more efficient to do the equivalent of that by storing the all possibilities in a set
, which will make checking for membership both simpler and faster.
Here's what I mean (note that I also took the creation of the constants out of the loop):
START_CODONS = {'AUG': 'M',
'GAA': 'E',
'ACC': 'T'}
STOP_CODONS = {'UAA', 'UGA', 'UAG'}
a_seq = 'AAAAUGGAAUGAACC'
kmer_size = 3
for start in range (0, len(a_seq)-kmer_size+1, 1):
kmer = a_seq[start: start+kmer_size]
if kmer == 'AUG':
start_codon = a_seq.index(kmer)
new_seq = a_seq[start_codon:]
last_codon_start = len(new_seq) - 2
protein = ''
for start in range(0, last_codon_start, 3):
codon = new_seq[start: start+3]
print(codon)
# if codon == 'UAA' or codon == 'UGA' or codon == 'UAG':
if codon in STOP_CODONS:
break
else:
amino_acid = START_CODONS.get(codon, 'X')
protein += amino_acid
print('protein:', protein)
break
Output:
AUG
GAA
UGA
protein: ME
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.