简体   繁体   中英

Retrieving whole genome genbank files for some organism using Biojava or Biopython

does anyone have an idea how to automatically search and parse gbk files from FTP ncbi using either BIopython or BioJAVA. I have searched for the utilities in BIojava and have not found any. I have also tried BioPython and here is my Code:

from Bio import Entrez
Entrez.email = "test@yahoo.com"
Entrez.tool = "MyLocalScript"
handle = Entrez.esearch(db="nucleotide", term="Mycobacterium avium[Orgn]")
record = Entrez.read(handle)
print record
print record["Count"]
id_L = record["IdList"]
print id_L
print len(id_L)

However, there are only 3 mycobacterium avium species (whole genome sequences and fully annotated) the result I am getting is 59897.

Can anyone tell me how to perform the search either in BioJava or BioPython. Otherwise I will have to automate this process form scratch.

Thank you.

The way we do it is by specifying the id specifically using the efetch interface:

Entrez.efetch(db="nucleotide", id=<ACCESSION ID HERE>, rettype="gb", retmode="text")

Using a search term such as the one you used returns too many matches, all of which you are downloading. See 48 different bioprojects with your search term here:

http://www.ncbi.nlm.nih.gov/bioproject/?term=Mycobacterium+avium

From experience, the most accurate way to get what you want is to use the ACCESSION ID.

If you want to dynamically search the NCBI for this information in an automated way, you can do searches by name in the same way as with EFetch using the ESearch interface. This way you can get accesion IDs and then use this list to fetch the nucleotide information (or any information you need) with EFetch.

http://www.ncbi.nlm.nih.gov/books/NBK25499/#_chapter4_ESearch_

The Entrez E-Utilities are very flexible, although it is true that you will need to filter the results to only obtain the data you need.

However, if you are going to do further analysis with this data and you do not need to be very up-to-date with the latest version of the sequences, nor to have dynamic fetching of different types of data, maybe it is better that you just download the data you need from the ftp and locally process/filter it. That might be faster than performing queries against Entrez (which is in my opinion a little slow when queried in batch).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM