简体   繁体   中英

Extract specific Fasta data from NCBI web page

  1. i'm trying to extract just the fasta format of the "NM_213035.1", and the output is all the page except the fasta.
  2. the inspect part of the fasta is existed in <pre tag.

the code:

import bs4
import sys
from bs4 import BeautifulSoup
import requests

url = requests.get("https://www.ncbi.nlm.nih.gov/nuccore/{FASTA}?report=fasta".format(FASTA="NM_213035.1"))
url.raise_for_status()
ncbi = bs4.BeautifulSoup(url.text, "html.parser")

filename = ncbi.title.text
with open(filename, 'w+') as f:
    for i in ncbi.select('p'):
        f.write(i.getText())

the output:

Warning: The NCBI web site requires JavaScript to function. more... Download features.Download gene features.NCBI Reference Sequence: NM_213035.1

GenBank Graphics

Whole sequence

Selected region

from:

to:

Show reverse complement

Show gap features Your browsing activity is empty.Activity recording is turned off. Turn recording back on

National Center for Biotechnology Information, US National Library of Medicine

8600 Rockville Pike, Bethesda MD, 20894 USA

You are not using the correct URL to fetch FASTA files via the REST API. As @Ghoti pointed out, the correct URLs are described here: https://www.ncbi.nlm.nih.gov/books/NBK25497/

For you specific problem this would be:

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=NM_213035.1&rettype=fasta&retmode=text

If you are using Python, you could use Biotite for this task, a package I am developing: https://www.biotite-python.org/apidoc/biotite.database.entrez.fetch.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM