简体   繁体   English

如何检查序列是否是蛋白质序列?

[英]How to I check if a sequence is a protein sequence or not?

Given a random sequence, how can I check if that sequence is protein or not?给定一个随机序列,我如何检查该序列是否是蛋白质?

from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
my_prot = Seq("'TGEKPYVCQECGKAFNCSSYLSKHQR")
my_prot


my_prot.alphabet #How to make a check here ??

If your Seq object has an assigned alphabet, you can check if that alphabet is a protein alphabet: 如果您的Seq对象具有分配的字母,则可以检查该字母是否为蛋白质字母:

from Bio.Seq import Seq
from Bio.Alphabet import IUPAC, ProteinAlphabet
my_prot = Seq("TGEKPYVCQECGKAFNCSSYLSKHQR", alphabet=IUPAC.IUPACProtein())

print isinstance(my_prot.alphabet, ProteinAlphabet)

However, if you don't have the alphabet known, you'll have to employ some heuristics to guess whether or not it's a protein sequence. 但是,如果您不知道字母,则必须使用一些试探法来猜测它是否是蛋白质序列。 This could be as easy as checking if the sequence is entirely "ATC[GU]", or if it employs other letter codes. 就像检查序列是否完全是“ ATC [GU]”或使用其他字母代码一样容易。

But this isn't perfect. 但这并不完美。 For instance, the sequence "ATCG" could be alanine, threonine, cysteine, glycine (ie a protein), or it could be adenine, thymine, cytosine, guanine (DNA). 例如,序列“ ATCG”可以是丙氨酸,苏氨酸,半胱氨酸,甘氨酸(即一种蛋白质),也可以是腺嘌呤,胸腺嘧啶,胞嘧啶,鸟嘌呤(DNA)。 Similarly, "ACG" could be a protein, RNA, or DNA. 类似地,“ ACG”可以是蛋白质,RNA或DNA。 It's technically impossible to be sure that a sequence is DNA, and not a protein sequence. 从技术上讲,不可能确定一个序列是DNA而不是蛋白质序列。 However, if you have a SeqRecord or other context for the Seq , you may be able to check if it's a protein sequence. 但是,如果您具有SeqRecordSeq其他上下文,则可以检查它是否为蛋白质序列。

Apparently Biopython removed Bio.Alphabet显然 Biopython 删除了 Bio.Alphabet

copying from https://www.biostars.org/p/102/https://www.biostars.org/p/102/复制

You can use:您可以使用:


import re

from Bio.Seq import Seq

def validate(seq, alphabet='dna'):
    
    alphabets = {'dna': re.compile('^[acgtn]*$', re.I), 
             'protein': re.compile('^[acdefghiklmnpqrstvwy]*$', re.I)}


    if alphabets[alphabet].search(seq) is not None:
         return True
    else:
         return False



dataz = 'AAAAAAACCCCCCCCCCCCCCDDDDDDRRRRRRRREERRRRGGG'

pippo = Seq(dataz)

print(pippo, type(pippo))

print(validate(str(pippo), 'dna'))

print(validate(str(pippo), 'protein'))

dataz = 'atg'

pippo = Seq(dataz)

print(pippo, type(pippo))

print(validate(str(pippo), 'dna'))

print(validate(str(pippo), 'protein'))

output: output:

AAAAAAACCCCCCCCCCCCCCDDDDDDRRRRRRRREERRRRGGG <class 'Bio.Seq.Seq'>
False
True
atg <class 'Bio.Seq.Seq'>
True
True

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM