简体   繁体   中英

Rosalind translating rna into protein python

Here is my solution to the problem of rosalind project.

def prot(rna):
  for i in xrange(3, (5*len(rna))//4+1, 4):
    rna=rna[:i]+','+rna[i:]
  rnaList=rna.split(',')
  bases=['U','C','A','G']
  codons = [a+b+c for a in bases for b in bases for c in bases]
  amino_acids = 'FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG'
  codon_table = dict(zip(codons, amino_acids))
  peptide=[]
  for i in range (len (rnaList)):
    if codon_table[rnaList[i]]=='*':
      break
    peptide+=[codon_table[rnaList[i]]]
  output=''
  for i in peptide:
    output+=str(i)
  return output

If I run prot('AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA') , I get the correct output 'MAMAPRTEINSTRING' . However if the sequence of rna (the input string) is hundreds of nucleotides (characters) long I got an error:

 Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "<stdin>", line 11, in prot
 KeyError: 'CUGGAAACGCAGCCGACAUUCGCUGAAGUGUAG'

Can you point me where I went wrong?

Given that you have a KeyError , the problem must be in one of your attempts to access codon_table[rnaList[i]] . You are assuming each item in rnalist is three characters, but evidently, at some point, that stops being True and one of the items is 'CUGGAAACGCAGCCGACAUUCGCUGAAGUGUAG' .

This happens because when you reassign rna = rna[:i]+','+rna[i:] you change the length of rna , such that your indices i no longer reach the end of the list. This means that for any rna where len(rna) > 60 , the last item in the list will not have length 3. If there is a stop codon before you reach the item it isn't a problem, but if you reach it you get the KeyError .

I suggest you rewrite the start of your function, eg using the grouper recipe from itertools :

from itertools import izip_longest

def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

def prot(rna):
    rnaList = ["".join(t) for t in grouper(rna, 3)]
    ...

Note also that you can use

peptide.append(codon_table[rnaList[i]])

and

return "".join(peptide)

to simplify your code.

This does not answer your question, but note that you could solve this very succinctly using BioPython :

from Bio.Seq import Seq
from Bio.Alphabet import IUPAC

def rna2prot(rna):
    rna = Seq(rna, IUPAC.unambiguous_rna)
    return str(rna.translate(to_stop=True))

For example:

>>> print rna2prot('AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA')
MAMAPRTEINSTRING

Your code for breaking the rna into 3-char blocks is a bit nasty; you spend a lot of time breaking and rebuilding strings to no real purpose.

Building the codon_table only needs to be done once, not every time your function is run.

Here is a simplified version:

from itertools import product, takewhile

bases = "UCAG"
codons = ("".join(trio) for trio in product(bases, repeat=3))
amino_acids = 'FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG'
codon_table = dict(zip(codons, amino_acids))

def prot(rna):
    rna_codons = [rna[i:i+3] for i in range(0, len(rna) - 2, 3)]
    aminos = takewhile(
        lambda amino: amino != "*",
        (codon_table[codon] for codon in rna_codons)
    )
    return "".join(aminos)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM