简体   繁体   English

在BioPython中使用Entrez从GenBank检索和解析蛋白质序列

[英]Retrieving and parsing protein sequences from GenBank using Entrez in BioPython

As will soon be obvious, I am new to Python and coding in general. 很快就会发现,我是Python和编码方面的新手。 I have a list of Gene IDs stored as a text file and I want to use the Entrez functions to search the GenBank database and retrieve the protein sequences corresponding to the IDs. 我有一个存储为文本文件的基因ID列表,我想使用Entrez函数搜索GenBank数据库并检索与这些ID对应的蛋白质序列。 Ideally I want the end product to be a FASTA file as I am really only interested in the sequence at this point. 理想情况下,我希望最终产品是FASTA文件,因为我现在只对序列感兴趣。 Using the Biopython tutorial ( http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec15 ), I came up with this: 使用Biopython教程( http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec15 ),我想到了这个:

from Bio import Entrez
from Bio import SeqIO
Entrez.email = "me@mysite.com"
id_list = set(open('test.txt', 'rU'))
handle = Entrez.efetch(db="protein", id=id_list, rettype="fasta", retmode="text")   
for seq_record in SeqIO.parse(handle, "fasta"):
    print ">" + seq_record.id, seq_record.description
print seq_record.seq
handle.close()

But when I run it, I get the error: 但是当我运行它时,我得到了错误:

File "C:/Python27/Scripts/entrez_files.py", line 5, in <module>
  handle = Entrez.efetch(db="protein", id=id_list, rettype="fasta", retmode="text")
File "C:\Python27\lib\site-packages\Bio\Entrez\__init__.py", line 145, in efetch
  if ids.count(",") >= 200:
AttributeError: 'set' object has no attribute 'count'

I get a similar error every time I use rettype = 'fasta'. 每次我使用rettype ='fasta'时,都会收到类似的错误。 When I use rettype = 'gb' I don't get this error, but I really want to end up with a fasta file. 当我使用rettype ='gb'时,我没有得到这个错误,但是我真的想得到一个fasta文件。 Does anybody have some suggestions? 有人有建议吗? Thank you! 谢谢!

EDIT: sorry I neglected to include what the input file is like. 编辑:抱歉,我忽略了输入文件是什么样的。 In a perfect world the code would accept an input format like this: 在理想情况下,代码将接受如下输入格式:

gi|285016822|ref|YP_003374533.1|
gi|285018887|ref|YP_003376598.1|
gi|285016823|ref|YP_003374534.1|
gi|285016824|ref|YP_003374535.1| 
....

But I have also tried using a simplified version with only the Gene IDs (GIs) like this: 但是我也尝试过使用仅带有基因ID(GI)的简化版本,如下所示:

285016822 
285018887 
285016823
285016824...

As you can see in efetch's source code , the id parameter must have a count method. 如您在efetch的源代码所见id参数必须具有count方法。 Usually this will be a string with a single ID or a Python list with all the IDs. 通常,这将是具有单个ID的字符串或具有所有ID的Python列表。 You are using a set , presumably to eliminate repeated values, so you can convert to a list like this: 您正在使用set ,大概是为了消除重复的值,因此您可以转换为这样的列表:

id_list = set(open('test.txt', 'rU'))
id_list = list(id_list) 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 蟒蛇。 尝试使用BioPython将来自genbank文件的3个最长基因核苷酸序列的文件排序为fasta文件 - Python. Trying to sort a file for 3 longest gene nucleotide sequences from genbank file into fasta file using BioPython Biopython 1.60中的Bio.Entrez和蛋白质问题 - Issue with Bio.Entrez and protein in Biopython 1.60 使用biopython从entrez获得基因序列 - getting a gene sequence from entrez using biopython 在 biopython 中获取 ID 和蛋白质序列 - Get ID and protein sequences in biopython 使用Biopython Bio Entrez解析器解析PubMed Central XML - Parsing PubMed Central XML using Biopython Bio Entrez parse Python / Biopython。 用蛋白质序列解析文件后,获取匹配单词的序列枚举列表 - Python/Biopython. Get enumerated list of sequences matching words after parsing file with protein sequences 使用Biopython Entrez从Fasta记录访问序列元素 - Access sequence element from fasta record using Biopython Entrez 使用 biopython 的 SeqIO 解析 genbank 文件格式 - Parsing a genbank file format with biopython's SeqIO 在Biopython中捕获Genbank文件解析错误 - Catch Genbank File parsing error in Biopython Biopython:如何避免蛋白质的特定氨基酸序列,以便绘制Ramachandran图? - Biopython: How to avoid particular amino acid sequences from a protein so as to plot Ramachandran plot?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM