简体   繁体   中英

Downloading DNA sequence data in R using entrez_fetch: cannot retrieve query

I'm trying to download DNA sequence data from NCBI using entrez_fetch . With the following code, I perform a search for the IDs of the sequences I need with entrez_search , and then I attempt to download the sequence data in FASTA format:

library(rentrez)
#Search for sequence ids
search <- entrez_search(db = "biosample", 
                        term = "Escherichia coli[Organism] AND geo_loc_name=USA:WA[attr]",
                        retmax = 9999, use_history = T)

search$ids
length(search$ids)
search$web_history

#Download sequence data
ecoli_fasta <- entrez_fetch(db = "nuccore",
                            web_history = search$web_history,
                            rettype = "fasta")

When I do this, I get the following error:

Error: HTTP failure: 400
Cannot+retrieve+query+from+history

I don't understand what this means and Googling hasn't led me to an answer.

I tried using a different package ( ape ) and the function read.GenBank to download the sequences as an alternative, but this method only managed to download about 1000 of the 12000 sequences I needed. I would like the use entrez_fetch if possible - does anyone have any insight for me?

This may be a starter.

Also be aware that queries to genome databases can return massive amounts of data, so be sure to limit your queries.

Build search web history

library(rentrez)

search <- entrez_search(db="nuccore", 
                        term="Escherichia coli[Organism]", 
                        use_history = T)

Use web history to fetch data

cat(entrez_fetch(db="nuccore", 
  web_history=search$web_history, rettype="fasta",  retstart=24, retmax=100))
>pdb|7QQ3|I Chain I, 23S ribosomal RNA
NGTTAAGCGACTAAGCGTACACGGTGGATGCCCTGGCAGTCAGAGGCGATGAAGGACGTGCTAATCTGCG
ATAAGCGTCGGTAAGGTGATATGAACCGTTATAACCGGCGATTTCCGAATGGGGAAACCCAGTGTGTTTC
GACACACTATCATTAACTGAATCCATAGGTTAATGAGGCGAACCGGGGGAACTGAAACATCTAAGTACCC
CGAGGAAAAGAAATCAACCGAGATTCCCCCAGTAGCGGCGAGCGAACGGGGAGCAGCCCAGAGCCTGAAT
CAGTGTGTGTGTTAGTGGAAGCGTCTGGAAAGGCGCGCGATACAGGGTGACAGCCCCGTACACAAAAATG
CACATGCTGTGAGCTCGATGAGTAGGGCGGGACACGTGGTATCCTGTCTGAATATGGGGGGACCATCCTC
CAAGGCTAAATACTCCTGACTGACCGATAGTGAACCAGTACCGTGAGGGAAAGGCGAAAAGAACCCCGGC
...

Use a loop to cycle through sequences, eg

for(i in seq(1, 300, 100)){
  cat(entrez_fetch(db="nuccore", 
    web_history=search$web_history, rettype="fasta",  retstart=i, retmax=100))
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM