[英]How to I check if a sequence is a protein sequence or not?
Given a random sequence, how can I check if that sequence is protein or not?给定一个随机序列,我如何检查该序列是否是蛋白质?
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
my_prot = Seq("'TGEKPYVCQECGKAFNCSSYLSKHQR")
my_prot
my_prot.alphabet #How to make a check here ??
If your Seq
object has an assigned alphabet, you can check if that alphabet is a protein alphabet: 如果您的Seq
对象具有分配的字母,则可以检查该字母是否为蛋白质字母:
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC, ProteinAlphabet
my_prot = Seq("TGEKPYVCQECGKAFNCSSYLSKHQR", alphabet=IUPAC.IUPACProtein())
print isinstance(my_prot.alphabet, ProteinAlphabet)
However, if you don't have the alphabet known, you'll have to employ some heuristics to guess whether or not it's a protein sequence. 但是,如果您不知道字母,则必须使用一些试探法来猜测它是否是蛋白质序列。 This could be as easy as checking if the sequence is entirely "ATC[GU]", or if it employs other letter codes. 就像检查序列是否完全是“ ATC [GU]”或使用其他字母代码一样容易。
But this isn't perfect. 但这并不完美。 For instance, the sequence "ATCG" could be alanine, threonine, cysteine, glycine (ie a protein), or it could be adenine, thymine, cytosine, guanine (DNA). 例如,序列“ ATCG”可以是丙氨酸,苏氨酸,半胱氨酸,甘氨酸(即一种蛋白质),也可以是腺嘌呤,胸腺嘧啶,胞嘧啶,鸟嘌呤(DNA)。 Similarly, "ACG" could be a protein, RNA, or DNA. 类似地,“ ACG”可以是蛋白质,RNA或DNA。 It's technically impossible to be sure that a sequence is DNA, and not a protein sequence. 从技术上讲,不可能确定一个序列是DNA而不是蛋白质序列。 However, if you have a SeqRecord
or other context for the Seq
, you may be able to check if it's a protein sequence. 但是,如果您具有SeqRecord
或Seq
其他上下文,则可以检查它是否为蛋白质序列。
Apparently Biopython removed Bio.Alphabet显然 Biopython 删除了 Bio.Alphabet
copying from https://www.biostars.org/p/102/从https://www.biostars.org/p/102/复制
You can use:您可以使用:
import re
from Bio.Seq import Seq
def validate(seq, alphabet='dna'):
alphabets = {'dna': re.compile('^[acgtn]*$', re.I),
'protein': re.compile('^[acdefghiklmnpqrstvwy]*$', re.I)}
if alphabets[alphabet].search(seq) is not None:
return True
else:
return False
dataz = 'AAAAAAACCCCCCCCCCCCCCDDDDDDRRRRRRRREERRRRGGG'
pippo = Seq(dataz)
print(pippo, type(pippo))
print(validate(str(pippo), 'dna'))
print(validate(str(pippo), 'protein'))
dataz = 'atg'
pippo = Seq(dataz)
print(pippo, type(pippo))
print(validate(str(pippo), 'dna'))
print(validate(str(pippo), 'protein'))
output: output:
AAAAAAACCCCCCCCCCCCCCDDDDDDRRRRRRRREERRRRGGG <class 'Bio.Seq.Seq'>
False
True
atg <class 'Bio.Seq.Seq'>
True
True
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.