简体   繁体   English

使用Biopython的翻译功能后,如何跟踪核苷酸序列中起始密码子(ATG)的位置?

[英]How to track the position of a start codon (ATG) in a nucleotide sequence after using the translate function of Biopython?

I have a FASTA file with a bunch of sequences with the following format: 我有一个FASTA文件,其中包含一堆序列,格式如下:

BMRat|XM_008846946.1 ATGAAGAACATCACAGAAGCCACCACCTTCATTCTCAAGGGACTCACAGACAATGTGGAACTACAGGTCA TCCTCTTTTTTCTCTTTCTAGCGATTTATCTCTTCACTCTCATAGGAAATTTAGGACTTATTATTTTAGT TATTGGGGATTCAAAACTCCACAACCCTATGTACTGTTTTCTGAGTGTATTGTCTTCTGTAGATGCCTGC TATTCCTCAGACATCACCCCGAATATGTTAGTAGGCTTCCTGTCAAAAAACAAAGGCATTTCTCTCCATG GATGTGCAACACAGTTGTTTCTCGCTGTTACTTTTGGAACCACAGAATGCTTTCTGTTGGCGGCAATGGC TTATGACCGCTATGTAGCCATCCATGACCCACTTCTCTATGCAGTGAGCATGTCACCAAGGATCTATGTG CCGCTCATCATTGCTTCCTATGCTGGTGGAATTCTGCATGCGATTATCCACACCGTGGCCACCTTCAGCC TGTCCTTCTGTGGATCTAATGAAATCAGTCATATATTCTGTGACATCCCTCCTCTGCTGGCTATTTCTTG TTCTGACACTTACATCAATGAGCTCCTGTTGTTCTTCTTTGTGAGCTCCATAGAAATAGTCACTATCCTC ATCATCCTGGTCTCTTATGGTTTCATCCTTATGGCCATTCTGAAGATGAATTCAGCTGAAGGGAGGAGAA AAGTCTTCTCTGCATGTGGGTCTCACCTAACTGGAGTGTCCATTTTCTATGGGACAAGCCTTTTCATGTA TGTGAGACCAAGCTCCAACTATTCCTTGGCACATGACATGGTAGTGTCGACATTTTATACCATTGTGATT CCCATGCTGAACCCTGTCATCTACAGTCTGAGGAACAAAGATGTGAAAGAGGCAATGAGAAGATTTTTGA AGAAAAATTTTCAGAAACTTTAA BMRat | XM_008846946.1 ATGAAGAACATCACAGAAGCCACCACCTTCATTCTCAAGGGACTCACAGACAATGTGGAACTACAGGTCA TCCTCTTTTTTCTCTTTCTAGCGATTTATCTCTTCACTCTCATAGGAAATTTAGGACTTATTATTTTAGT TATTGGGGATTCAAAACTCCACAACCCTATGTACTGTTTTCTGAGTGTATTGTCTTCTGTAGATGCCTGC TATTCCTCAGACATCACCCCGAATATGTTAGTAGGCTTCCTGTCAAAAAACAAAGGCATTTCTCTCCATG GATGTGCAACACAGTTGTTTCTCGCTGTTACTTTTGGAACCACAGAATGCTTTCTGTTGGCGGCAATGGC TTATGACCGCTATGTAGCCATCCATGACCCACTTCTCTATGCAGTGAGCATGTCACCAAGGATCTATGTG CCGCTCATCATTGCTTCCTATGCTGGTGGAATTCTGCATGCGATTATCCACACCGTGGCCACCTTCAGCC TGTCCTTCTGTGGATCTAATGAAATCAGTCATATATTCTGTGACATCCCTCCTCTGCTGGCTATTTCTTG TTCTGACACTTACATCAATGAGCTCCTGTTGTTCTTCTTTGTGAGCTCCATAGAAATAGTCACTATCCTC ATCATCCTGGTCTCTTATGGTTTCATCCTTATGGCCATTCTGAAGATGAATTCAGCTGAAGGGAGGAGAA AAGTCTTCTCTGCATGTGGGTCTCACCTAACTGGAGTGTCCATTTTCTATGGGACAAGCCTTTTCATGTA TGTGAGACCAAGCTCCAACTATTCCTTGGCACATGACATGGTAGTGTCGACATTTTATACCATTGTGATT CCCATGCTGAACCCTGTCATCTACAGTCTGAGGAACAAAGATGTGAAAGAGGCAATGAGAAGATTTTTGA AGAAAAATTTTCAGAAACTTTAA

The code implemented using biopython http://biopython.org/wiki/Seq allows me to find the longest sequence of amino acids that starts with Methionine and ends with a Stop codon, of each sequence in the FASTA file. 使用biopython http://biopython.org/wiki/Seq实现的代码使我能够找到FASTA文件中每个序列的最长氨基酸序列,该序列以蛋氨酸开头,以终止密码子结尾。

The function is find_largest_polypeptide_in_DNA . 该函数是find_largest_polypeptide_in_DNA Basically it translates the DNA sequence to an amino acid sequence using the 3 different forward reading frames, and in the variable allPossibilities it saves the segments that starts with M (a particular amino acid) and end in a stop codon. 基本上,它使用3个不同的前向阅读框将DNA序列翻译为氨基酸序列,并在allPossibilities变量中保存了以M(特定氨基酸)开头并以终止密码子结尾的片段。 Then it compares the lengths of the possibilities and selects the longest possibility, returning the protein sequence of that segment. 然后,它比较可能性的长度并选择最长的可能性,返回该片段的蛋白质序列。

def find_largest_polypeptide_in_DNA(seq, translationTable=1):
    allPossibilities = []
    for frame in range(3):
        trans = str(seq[frame:].translate(translationTable))
        framePossibilitiesF = [i[i.find("M"):] for i in trans.split("*") if "M" in i]
        allPossibilities += framePossibilitiesF
    allPossibilitiesLengths = [len(i) for i in allPossibilities]

    if len(allPossibilitiesLengths) == 0:
        raise Exception("no candidate ORFs")

    proteinAsString = allPossibilities[allPossibilitiesLengths.index(max(allPossibilitiesLengths))]

    return Seq(proteinAsString, alphabet=ProteinAlphabet)

It works perfect, but now I want to get the DNA sequence that corresponds to that sequence of proteins returned by the function. 它可以完美工作,但是现在我想获得与该功能返回的蛋白质序列相对应的DNA序列。 I need to add some lines to the function in order to get both sequences but I don't really know how. 我需要在函数中添加一些行以获取两个序列,但是我真的不知道如何。 I dont know if it's possible to track the position of each Methionine of the i.find("M") and then use that position to track it in the nucleotide sequence. 我不知道是否有可能跟踪i.find(“ M”)的每个蛋氨酸的位置,然后使用该位置在核苷酸序列中进行跟踪。

Thanks. 谢谢。

I think it would be easiest to write a new function following similar principles. 我认为遵循类似的原则编写新函数将是最容易的。 Your idea "to track the position of each Methionine of the i.find('M')" is basically what's done below. 您的想法“跟踪i.find('M')的每个蛋氨酸的位置”基本上是以下操作。 The difficulty in doing this with the code you're starting with is that the sequences get chopped up with the split('*') and so the DNA starting position is the sum of the reading frame offset plus all the codons of segments previous to the sequence of concern. 使用开始的代码执行此操作的困难在于,序列会被split('*')切碎,因此DNA的起始位置是阅读框偏移量加上之前所有片段的密码子的总和。关注的顺序。 Per your clarification, I added an enclosing loop to iterate across forward and backward directions. 根据您的说明,我添加了一个封闭循环以在向前和向后的方向上进行迭代。

def find_largest_polypeptide_in_DNA(seq, translationTable=1):
    # Set the record to start with, then try to beat it
    longest_DNA = ''
    longest_amino_acid_sequence = 0

    for direction in [-1, 1]:
        forward_DNA = seq[::direction]
        # Check all three reading frames in this direction.
        for frame in range(3):
            trans = str(forward_DNA[frame:].translate(translationTable))
            cut_codons = 0
            while 'M' in trans:
                codons_before_Met = trans.find('M')
                cut_codons += codons_before_Met
                trans = trans[codons_before_Met:]
                if '*' in trans:
                    length = trans.find('*') + 1 
                    if length > longest_amino_acid_sequence:
                        longest_amino_acid_sequence = length
                        first_bp = frame + 3*cut_codons
                        last_bp = frame + 3*cut_codons + 3*(length)
                        longest_DNA = str(forward_DNA[first_bp:last_bp+1])
                    trans = trans[length:]
                else:
                    # Ignore sequence M... if ORF extends beyond FASTA?
                    trans = ''
    return longest_DNA

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM