简体   繁体   English

使用Biojava或Biopython检索某些生物的全基因组基因库文件

[英]Retrieving whole genome genbank files for some organism using Biojava or Biopython

does anyone have an idea how to automatically search and parse gbk files from FTP ncbi using either BIopython or BioJAVA. 有谁知道如何使用BIopython或BioJAVA从FTP ncbi自动搜索和解析gbk文件。 I have searched for the utilities in BIojava and have not found any. 我已经在BIojava中搜索了实用程序,但是没有找到任何实用程序。 I have also tried BioPython and here is my Code: 我也尝试过BioPython,这是我的代码:

from Bio import Entrez
Entrez.email = "test@yahoo.com"
Entrez.tool = "MyLocalScript"
handle = Entrez.esearch(db="nucleotide", term="Mycobacterium avium[Orgn]")
record = Entrez.read(handle)
print record
print record["Count"]
id_L = record["IdList"]
print id_L
print len(id_L)

However, there are only 3 mycobacterium avium species (whole genome sequences and fully annotated) the result I am getting is 59897. 但是,只有3种鸟分枝杆菌属物种(完整的基因组序列和完整注释),我得到的结果是59897。

Can anyone tell me how to perform the search either in BioJava or BioPython. 谁能告诉我如何在BioJava或BioPython中执行搜索。 Otherwise I will have to automate this process form scratch. 否则,我将不得不从头开始自动执行此过程。

Thank you. 谢谢。

The way we do it is by specifying the id specifically using the efetch interface: 我们的方法是通过使用efetch接口专门指定id:

Entrez.efetch(db="nucleotide", id=<ACCESSION ID HERE>, rettype="gb", retmode="text")

Using a search term such as the one you used returns too many matches, all of which you are downloading. 使用搜索词(例如您使用的搜索词)会返回太多匹配项,您正在下载所有这些匹配项。 See 48 different bioprojects with your search term here: 在这里查看48个不同的生物项目以及您的搜索词:

http://www.ncbi.nlm.nih.gov/bioproject/?term=Mycobacterium+avium http://www.ncbi.nlm.nih.gov/bioproject/?term=Mycobacterium+avium

From experience, the most accurate way to get what you want is to use the ACCESSION ID. 根据经验,获得所需内容的最准确方法是使用ACCESSION ID。

If you want to dynamically search the NCBI for this information in an automated way, you can do searches by name in the same way as with EFetch using the ESearch interface. 如果要自动以动态方式在NCBI中搜索此信息,则可以按名称进行搜索,方法与使用ESearch界面的EFetch相同。 This way you can get accesion IDs and then use this list to fetch the nucleotide information (or any information you need) with EFetch. 通过这种方式,您可以获取配件ID,然后使用此列表通过EFetch获取核苷酸信息(或所需的任何信息)。

http://www.ncbi.nlm.nih.gov/books/NBK25499/#_chapter4_ESearch_ http://www.ncbi.nlm.nih.gov/books/NBK25499/#_chapter4_ESearch_

The Entrez E-Utilities are very flexible, although it is true that you will need to filter the results to only obtain the data you need. 尽管确实需要过滤结果以仅获取所需数据,但Entrez E-Utilities非常灵活。

However, if you are going to do further analysis with this data and you do not need to be very up-to-date with the latest version of the sequences, nor to have dynamic fetching of different types of data, maybe it is better that you just download the data you need from the ftp and locally process/filter it. 但是,如果您要对这些数据进行进一步的分析,而无需了解最新的序列版本,也不需要动态获取不同类型的数据,那么可能会更好您只需从ftp下载所需的数据并在本地处理/过滤。 That might be faster than performing queries against Entrez (which is in my opinion a little slow when queried in batch). 这可能比对Entrez的查询要快(我认为批量查询时有点慢)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM