蛋白质到 RNA 密码子

Question

I have this question where we need to write a code that takes a protein fasta file and the protein sequence identifier, and counts all the possible RNA combinations for the sequence in the fasta file, with a condition that the total of combinations should be less than 5000.我有这个问题，我们需要编写一个代码，该代码采用蛋白质 fasta 文件和蛋白质序列标识符，并计算 fasta 文件中序列的所有可能 RNA 组合，条件是组合总数应小于5000。

I started with making an RNA codons dictionary, then I made a function that puts the elements of the fasta file (amino acids) into a list, then I tried to do combinations from that list, but I get an error and I tried but didn't know where the problem is, if anyone can check the code and tell me what's wrong I would be grateful.我开始制作一个 RNA 密码子字典，然后我制作了一个 function 将 fasta 文件（氨基酸）的元素放入一个列表中，然后我尝试从该列表中进行组合，但我得到一个错误，我尝试但没有不知道问题出在哪里，如果有人可以检查代码并告诉我出了什么问题，我将不胜感激。

import itertools

from Bio import SeqIO

RNA_codon_table = {
'A': ('GCU', 'GCC', 'GCA', 'GCG'),
'C': ('UGU', 'UGC'),
'D': ('GAU', 'GAC'),
'E': ('GAA', 'GAG'),
'F': ('UUU', 'UUC'),
'G': ('GGU', 'GGC', 'GGA', 'GGG'),
'H': ('CAU', 'CAC'),
'I': ('AUU', 'AUC', 'AUA'),
'K': ('AAA', 'AAG'),
'L': ('UUA', 'UUG', 'CUU', 'CUC', 'CUA', 'CUG'),
'M': ('AUG',),
'N': ('AAU', 'AAC'),
'P': ('CCU', 'CCC', 'CCA', 'CCG'),
'Q': ('CAA', 'CAG'),
'R': ('CGU', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'),
'S': ('UCU', 'UCC', 'UCA', 'UCG', 'AGU', 'AGC'),
'T': ('ACU', 'ACC', 'ACA', 'ACG'),
'V': ('GUU', 'GUC', 'GUA', 'GUG'),
'W': ('UGG',),
'Y': ('UAU', 'UAC'),}
 
def protein_fasta (protein_file):
   protein_sequence = []
   protein = SeqIO.parse(protein_file, format = 'fasta')
   for Seqrecord in protein: 
      protein_sequence.append(Seqrecord.seq)
   print (protein_sequence)


for seq in protein_sequence:
     codons = [ list(RNA_codon_table[key]) for key in protein_sequence ]
print(list(itertools.product(codons)))

I'm sorry I don't know how to attach a fasta file, but this is the sequence inside:对不起，我不知道如何附加 fasta 文件，但这是里面的序列：

>seq_compl complete sequence
IEEATHMTPCYELHGLRWVQIQDYAINVMQCL

This is the error I get:这是我得到的错误：

 KeyError                                  Traceback (most recent call last)
<ipython-input-65-3dd46947c505> in <module>
----> 1 all_combinations ('short_protein.fasta')

<ipython-input-64-45a50fffc1d9> in all_combinations(protein_file)
      5        protein_sequence.append(Seqrecord.seq)
      6 
----> 7    codons = [ list(RNA_codon_table[key]) for key in protein_sequence 
]
      8    print(list(itertools.product(codons)))

<ipython-input-64-45a50fffc1d9> in <listcomp>(.0)
      5        protein_sequence.append(Seqrecord.seq)
      6 
----> 7    codons = [ list(RNA_codon_table[key]) for key in protein_sequence 
 ]
      8    print(list(itertools.product(codons)))

 KeyError: Seq('IEEATHMTPCYELHGLRWVQIQDYAINVMQCL')

Answer 1

Based on your example, the protein_sequence variable is currently only declared in local scope to the protein_fasta function.根据您的示例，protein_sequence 变量目前仅在本地 scope 中声明到 protein_fasta function。

You will need to assign the result of this function to a variable before you can iterate over it.您需要先将此 function 的结果分配给一个变量，然后才能对其进行迭代。

For example, switch your print to a return:例如，将您的打印切换为退货：

def protein_fasta (protein_file):
   protein_sequence = []
   protein = SeqIO.parse(protein_file, format = 'fasta')
   for Seqrecord in protein: 
      protein_sequence.append(Seqrecord.seq)
   return protein_sequence

And make sure to call and assign the result of the function:并确保调用并分配 function 的结果：

protein_sequence = protein_fasta(protein_file)

Now you have something you can iterate over.现在你有了可以迭代的东西。

I can see an additional problem with your for loop.我可以看到您的 for 循环还有一个问题。 You aren't doing anything with seq .你没有对seq做任何事情。 Presumably protein_sequence should be swapped for seq in this instance.在这种情况下，大概 protein_sequence 应该换成 seq。 I've also taken out the list wrapping RNA_codon_table, as I think it's not needed in this case:我还取出了包装 RNA_codon_table 的列表，因为我认为在这种情况下不需要它：

for seq in protein_sequence:
    codons = [ RNA_codon_table[key] for key in seq ]
    print(list(itertools.product(*codons)))

Answer 2

Your protein string will produce billions of combinations:您的蛋白质串将产生数十亿种组合：

from itertools import product,islice
def protGen(proteins):
   for codons in product(*(RNA_codon_table[P] for P in proteins)):
       yield "".join(codons)

counting combinations:计数组合：

proteins = "IEEATHMTPCYELHGLRWVQIQDYAINVMQCL"
    
count = 1
for P in proteins: count *= len(RNA_codon_table[P])

print(count) # 37,572,373,905,408 combinations

output: output：

for protSeq in islice(protGen(proteins),500): # first 500
    print(protSeq)

AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAAUGUUUA
AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAAUGUUUG
AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAAUGUCUU
AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAAUGUCUC
AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAAUGUCUA
AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAAUGUCUG
AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAAUGCUUA
AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAAUGCUUG
AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAAUGCCUU
AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAAUGCCUC
AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAAUGCCUA
AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAAUGCCUG
AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAGUGUUUA
AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAGUGUUUG
AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAGUGUCUU
AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAGUGUCUC
AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAGUGUCUA

蛋白质到 RNA 密码子

问题描述

2 个解决方案

解决方案1
2 2020-12-23 12:16:26

解决方案2
0 2021-01-22 17:43:10

蛋白质到 RNA 密码子

问题描述

2 个解决方案

解决方案1 2 2020-12-23 12:16:26

解决方案2 0 2021-01-22 17:43:10

解决方案1
2 2020-12-23 12:16:26

解决方案2
0 2021-01-22 17:43:10