简体   繁体   中英

BeautifulSoup doesn't find all spans or children

I am trying to access the sequence on this webpage:
https://www.ncbi.nlm.nih.gov/protein/EGW15053.1?report=fasta

The sequence is stored under the div class="seq gbff". Each line is stored under

<span class='ff_line' id='gi_344258949_1"> *line 1 of sequence* </span>

When I try to search for the spans containing the sequence, beautiful soup returns None . Same problem when I try to look at the children or content of the div above the spans .

Here is the code:

import requests
import re
from bs4 import BeautifulSoup
# Create a variable with the url
url = 'https://www.ncbi.nlm.nih.gov/protein/EGW15053.1?report=fasta'

# Use requests to get the contents
r = requests.get(url)

# Get the text of the contents
html_content = r.text

# Convert the html content into a beautiful soup object
soup = BeautifulSoup(html_content, 'html.parser')


div = soup.find_all('div', attrs={'class', 'seq gbff'})
for each in div.children:
    print(each)
soup.find_all('span', aatrs={'class', 'ff_line'})

Neither method works and I'd greatly appreciate any help :D

This page uses JavaScript to load data

With DevTools in Chrome/Firefox I found this url and there are all <span>

https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id=344258949&db=protein&report=fasta&extrafeat=0&fmt_mask=0&retmode=html&withmarkup=on&tool=portal&log $=seqview&maxdownloadsize=1000000

Now hard part. You have to find this url in HTML because different pages will use different arguments in url. Or you have to compare few urls and find schema so you could generate this url manually.


EDIT: if in url you change retmode=html to retmode=xml then you get it as XML . If you use retmode=text then you get it as text without HTML tags. retmode=json doesn't works.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM